[
https://issues.apache.org/jira/browse/AVRO-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Daniel Kulp updated AVRO-1922:
------------------------------
Fix Version/s: (was: 1.9.0)
> Fixed dimension for array
> -------------------------
>
> Key: AVRO-1922
> URL: https://issues.apache.org/jira/browse/AVRO-1922
> Project: Apache Avro
> Issue Type: New Feature
> Components: spec
> Affects Versions: 1.8.1
> Reporter: Jim Pivarski
> Priority: Major
>
> This is a feature request for future versions of the Avro specification.
> We have found one kind of data structure that is hard to express in Avro:
> tensors. Although we can (and do) build matrices as {"type": "array",
> "items": {"type": "array", "items": "double"}}, this type does not specify
> that the grid of numbers is rectangular. We believe that rectangular arrays
> of numbers (or other nested types) would be a strong addition to Avro, both
> as a type system and as a serialization format. With the total size of all
> dimensions fixed in the schema, they would not need to be repeated in each
> serialized datum.
> For instance, suppose there was an extension of type "array" to specify
> dimensions:
> {"type": "array", "dimensions": [3, 3, 3, 3], "items": "double"}
> This 3-by-3-by-3-by-3 tensor (representing, for instance, the Riemann
> curvature tensor in 3-space) specifies that 81 double-precision numbers
> (3*3*3*3) are expected for each datum. With nested arrays, the size, "3,"
> would have to be separately encoded 40 times (1 + 3*(1 + 3*(1 + 3))) for each
> datum, even though they would never change in a dataset of Riemann tensors.
> With a "dimensions" attribute in the schema, only the content needs to be
> serialized. Moreover, this extension can clearly be used with any other
> "items" type, to make dense tables of strings, for instance.
> Avro has been extended in a similar way in the past. The "fixed" type is a
> "bytes" without the need to specify the number of bytes for each datum. Our
> proposal provides a similar packing for structured objects that can be
> significant for large numbers of dimensions, as shown above. The advantage to
> consumers of Avro data is that we can write functions that do not need to
> check all array sizes at runtime (for operations like tensor contractions and
> products).
> We have searched the web and the Avro JIRA site for similar proposals and
> found none, so we're adding this proposal to JIRA in addition to this e-mail.
> Please let us know if you have any comments, or if we can provide any more
> information.
> Thank you!
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)