Greetings! We represent the Data Mining Group <http://dmg.org/>, a 501(c)3 organization managing data mining standards such as the Portable Format for Analytics (PFA) <http://dmg.org/pfa/>. PFA is used by data scientists to transport and deploy predictive models in a standards-compliant way.
You may be interested to know that PFA, a JSON-based format, uses Avro as its type system <http://dmg.org/pfa/docs/avro_types/>. Although this is tangential to Avro's main goal as a serialization system, it fits well with our need to describe structured types in JSON. (We even use Avro's schema resolution to identify subtypes for covariant function arguments.) In our development of PFA, we have found one kind of data structure that is hard to express in Avro: tensors. Although we can (and do) build matrices as {"type": "array", "items": {"type": "array", "items": "double"}}, this type does not specify that the grid of numbers is rectangular. We believe that rectangular arrays of numbers (or other nested types) would be a strong addition to Avro, both as a type system and as a serialization format. With the total size of all dimensions fixed in the schema, they would not need to be repeated in each serialized datum. For instance, suppose there was an extension of type "array" to specify dimensions: {"type": "array", "dimensions": [3, 3, 3, 3], "items": "double"} This 3-by-3-by-3-by-3 tensor (representing, for instance, the Riemann curvature tensor <https://en.wikipedia.org/wiki/Riemann_curvature_tensor> in 3-space) specifies that 81 double-precision numbers (3*3*3*3) are expected for each datum. With nested arrays, the size, "3," would have to be separately encoded 40 times (1 + 3*(1 + 3*(1 + 3))) for each datum, even though they never change in a dataset of Riemann tensors. With a "dimensions" attribute in the schema, only the content needs to be serialized. Moreover, this extension can clearly be used with any other "items" type, to make dense tables of strings, for instance. Avro has been extended in a similar way in the past. The "fixed" type is a "bytes" without the need to specify the number of bytes for each datum. Our proposal provides a similar packing for structured objects that can be significant for large numbers of dimensions, as shown above. The advantage to PFA is that we can write functions that do not need to check all array sizes at runtime (for operations like tensor contractions and products). We have searched the web and the Avro JIRA site for similar proposals and found none, so we're adding this proposal to JIRA (see issue 1922 <https://issues.apache.org/jira/browse/AVRO-1922>) in addition to this e-mail. Please let us know if you have any comments, or if we can provide any more information. Thank you for your consideration! -- Walt Wells for the Data Mining Group
