We represent the Data Mining Group <>, a 501(c)3
organization managing data mining standards such as the Portable Format for
Analytics (PFA) <>. PFA is used by data scientists to
transport and deploy predictive models in a standards-compliant way.

You may be interested to know that PFA, a JSON-based format, uses Avro as
its type system <>. Although this is
tangential to Avro's main goal as a serialization system, it fits well with
our need to describe structured types in JSON. (We even use Avro's schema
resolution to identify subtypes for covariant function arguments.)

In our development of PFA, we have found one kind of data structure that is
hard to express in Avro: tensors. Although we can (and do) build matrices
as {"type": "array", "items": {"type": "array", "items": "double"}}, this
type does not specify that the grid of numbers is rectangular. We believe
that rectangular arrays of numbers (or other nested types) would be a
strong addition to Avro, both as a type system and as a serialization
format. With the total size of all dimensions fixed in the schema, they
would not need to be repeated in each serialized datum.

For instance, suppose there was an extension of type "array" to specify

{"type": "array", "dimensions": [3, 3, 3, 3], "items": "double"}

This 3-by-3-by-3-by-3 tensor (representing, for instance, the Riemann
curvature tensor <>
in 3-space) specifies that 81 double-precision numbers (3*3*3*3) are
expected for each datum. With nested arrays, the size, "3," would have to
be separately encoded 40 times (1 + 3*(1 + 3*(1 + 3))) for each datum, even
though they never change in a dataset of Riemann tensors. With a
"dimensions" attribute in the schema, only the content needs to be
serialized. Moreover, this extension can clearly be used with any other
"items" type, to make dense tables of strings, for instance.

Avro has been extended in a similar way in the past. The "fixed" type is a
"bytes" without the need to specify the number of bytes for each datum. Our
proposal provides a similar packing for structured objects that can be
significant for large numbers of dimensions, as shown above. The advantage
to PFA is that we can write functions that do not need to check all array
sizes at runtime (for operations like tensor contractions and products).

We have searched the web and the Avro JIRA site for similar proposals and
found none, so we're adding this proposal to JIRA (see issue 1922
<>) in addition to this
e-mail. Please let us know if you have any comments, or if we can provide
any more information.

Thank you for your consideration!
-- Walt Wells for the Data Mining Group

Reply via email to