The intention is that each individual record could have a different size. This could be consistent within a given batch, but wouldn't need to be. For example, if I wanted to send a 3-channel image, but the image size may vary for each record, then I could use FixedSizeList<FixedSizeList<FixedSizeList<Int8>[3]>[-1]>[-1].
On Mon, Jul 29, 2019 at 1:18 PM Brian Hulette <bhule...@apache.org> wrote: > This isn't really relevant but I feel compelled to point it out - the > FixedSizeList type has actually been in the Arrow spec for a while, but it > was only implemented in JS and Java initially. It was implemented in C++ > just a few months ago. > Thanks for the clarification -- I was going based on the blame history for Layout.rst, but I guess it just didn't get officially documented there until the c++ implementation was added. -Edward > On Mon, Jul 29, 2019 at 7:01 AM Edward Loper <edlo...@google.com.invalid> > wrote: > > > The FixedSizeList type, which was added to Arrow a few months ago, is an > > array where each slot contains a fixed-size sequence of values. It is > > specified as FixedSizeList<T>[N], where T is a child type and N is a > signed > > int32 that specifies the length of each list. > > > > This is useful for encoding fixed-size tensors. E.g., if I have a > 100x8x10 > > tensor, then I can encode it as > > FixedSizeList<FixedSizeList<FixedSizeList<byte>[10]>[8]>[100]. > > > > But I'm also interested in encoding tensors where some dimension sizes > are > > not known in advance. It seems to me that FixedSizeList could be > extended > > to support this fairly easily, by simply defining that N=-1 means "each > > array slot has the same length, but that length is not known in advance." > > So e.g. we could encode a 100x?x10 tensor as > > FixedSizeList<FixedSizeList<FixedSizeList<byte>[10]>[-1]>[100]. > > > > Since these N=-1 row-lengths are not encoded in the type, we need some > way > > to determine what they are. Luckily, every Field in the schema has a > > corresponding FieldNode in the message; and those FieldNodes can be used > to > > deduce the row lengths. In particular, the row length must be equal to > the > > length of the child node divided by the length of the FixedSizeList. > E.g., > > if we have a FixedSizeList<byte>[-1] array with the values [[1, 2], [3, > 4], > > [5, 6]] then the message representation is: > > > > * Length: 3, Null count: 0 > > * Null bitmap buffer: Not required > > * Values array (byte array): > > * Length: 6, Null count: 0 > > * Null bitmap buffer: Not required > > * Value buffer: [1, 2, 3, 4, 5, 6, <unspecified padding bytes>] > > > > So we can deduce that the row length is 6/3=2. > > > > It looks to me like it would be fairly easy to add support for this. > E.g., > > in the FixedSizeListArray constructor in c++, if list_type()->list_size() > > is -1, then set list_size_ to values.length()/length. There would be no > > changes to the schema.fbs/message.fbs files -- we would just be > assigning a > > meaning to something that's currently meaningless (having > > FixedSizeList.listSize=-1). > > > > If there's support for adding this to Arrow, then I could put together a > > PR. > > > > Thanks, > > -Edward > > > > P.S. Apologies if this gets posted twice -- I sent it out a couple days > ago > > right before subscribing to the mailing list; but I don't see it on the > > archives, presumably because I wasn't subscribed yet when I sent it out. > > >