The FixedSizeList type, which was added to Arrow a few months ago, is an
array where each slot contains a fixed-size sequence of values.  It is
specified as FixedSizeList<T>[N], where T is a child type and N is a signed
int32 that specifies the length of each list.

This is useful for encoding fixed-size tensors.  E.g., if I have a 100x8x10
tensor, then I can encode it as
FixedSizeList<FixedSizeList<FixedSizeList<byte>[10]>[8]>[100].

But I'm also interested in encoding tensors where some dimension sizes are
not known in advance.  It seems to me that FixedSizeList could be extended
to support this fairly easily, by simply defining that N=-1 means "each
array slot has the same length, but that length is not known in advance."
 So e.g. we could encode a 100x?x10 tensor as
FixedSizeList<FixedSizeList<FixedSizeList<byte>[10]>[-1]>[100].

Since these N=-1 row-lengths are not encoded in the type, we need some way
to determine what they are.  Luckily, every Field in the schema has a
corresponding FieldNode in the message; and those FieldNodes can be used to
deduce the row lengths.  In particular, the row length must be equal to the
length of the child node divided by the length of the FixedSizeList.  E.g.,
if we have a FixedSizeList<byte>[-1] array with the values [[1, 2], [3, 4],
[5, 6]] then the message representation is:

* Length: 3, Null count: 0
* Null bitmap buffer: Not required
* Values array (byte array):
    * Length: 6,  Null count: 0
    * Null bitmap buffer: Not required
    * Value buffer: [1, 2, 3, 4, 5, 6, <unspecified padding bytes>]

So we can deduce that the row length is 6/3=2.

It looks to me like it would be fairly easy to add support for this.  E.g.,
in the FixedSizeListArray constructor in c++, if list_type()->list_size()
is -1, then set list_size_ to values.length()/length.  There would be no
changes to the schema.fbs/message.fbs files -- we would just be assigning a
meaning to something that's currently meaningless (having
FixedSizeList.listSize=-1).

If there's support for adding this to Arrow, then I could put together a PR.

Thanks,
-Edward

P.S. Apologies if this gets posted twice -- I sent it out a couple days ago
right before subscribing to the mailing list; but I don't see it on the
archives, presumably because I wasn't subscribed yet when I sent it out.

Reply via email to