I'm a little confused about the proposal now. If the unknown dimension
doesn't have to be the same within a record batch, how would you be able to
deduce it with the approach you described (dividing the logical length of
the values array by the length of the record batch)?

On Wed, Jul 31, 2019 at 8:24 AM Wes McKinney <wesmck...@gmail.com> wrote:

> I agree this sounds like a good application for ExtensionType. At
> minimum, ExtensionType can be used to develop a working version of
> what you need to help guide further discussions.
>
> On Mon, Jul 29, 2019 at 2:29 PM Francois Saint-Jacques
> <fsaintjacq...@gmail.com> wrote:
> >
> > Hello,
> >
> > if each record has a different size, then I suggest to just use a
> > Struct<Dim, List<T>> where Dim is a struct (or expand in the outer
> > struct). You can probably add your own logic with the recently
> > introduced ExtensionType [1].
> >
> > François
> > [1]
> https://github.com/apache/arrow/blob/f77c3427ca801597b572fb197b92b0133269049b/cpp/src/arrow/extension_type.h
> >
> > On Mon, Jul 29, 2019 at 3:15 PM Edward Loper <edlo...@google.com.invalid>
> wrote:
> > >
> > > The intention is that each individual record could have a different
> size.
> > > This could be consistent within a given batch, but wouldn't need to be.
> > > For example, if I wanted to send a 3-channel image, but the image size
> may
> > > vary for each record, then I could use
> > > FixedSizeList<FixedSizeList<FixedSizeList<Int8>[3]>[-1]>[-1].
> > >
> > > On Mon, Jul 29, 2019 at 1:18 PM Brian Hulette <bhule...@apache.org>
> wrote:
> > >
> > > > This isn't really relevant but I feel compelled to point it out - the
> > > > FixedSizeList type has actually been in the Arrow spec for a while,
> but it
> > > > was only implemented in JS and Java initially. It was implemented in
> C++
> > > > just a few months ago.
> > > >
> > >
> > > Thanks for the clarification -- I was going based on the blame history
> for
> > > Layout.rst, but I guess it just didn't get officially documented there
> > > until the c++ implementation was added.
> > >
> > > -Edward
> > >
> > >
> > > > On Mon, Jul 29, 2019 at 7:01 AM Edward Loper
> <edlo...@google.com.invalid>
> > > > wrote:
> > > >
> > > > > The FixedSizeList type, which was added to Arrow a few months ago,
> is an
> > > > > array where each slot contains a fixed-size sequence of values.
> It is
> > > > > specified as FixedSizeList<T>[N], where T is a child type and N is
> a
> > > > signed
> > > > > int32 that specifies the length of each list.
> > > > >
> > > > > This is useful for encoding fixed-size tensors.  E.g., if I have a
> > > > 100x8x10
> > > > > tensor, then I can encode it as
> > > > > FixedSizeList<FixedSizeList<FixedSizeList<byte>[10]>[8]>[100].
> > > > >
> > > > > But I'm also interested in encoding tensors where some dimension
> sizes
> > > > are
> > > > > not known in advance.  It seems to me that FixedSizeList could be
> > > > extended
> > > > > to support this fairly easily, by simply defining that N=-1 means
> "each
> > > > > array slot has the same length, but that length is not known in
> advance."
> > > > >  So e.g. we could encode a 100x?x10 tensor as
> > > > > FixedSizeList<FixedSizeList<FixedSizeList<byte>[10]>[-1]>[100].
> > > > >
> > > > > Since these N=-1 row-lengths are not encoded in the type, we need
> some
> > > > way
> > > > > to determine what they are.  Luckily, every Field in the schema
> has a
> > > > > corresponding FieldNode in the message; and those FieldNodes can
> be used
> > > > to
> > > > > deduce the row lengths.  In particular, the row length must be
> equal to
> > > > the
> > > > > length of the child node divided by the length of the
> FixedSizeList.
> > > > E.g.,
> > > > > if we have a FixedSizeList<byte>[-1] array with the values [[1,
> 2], [3,
> > > > 4],
> > > > > [5, 6]] then the message representation is:
> > > > >
> > > > > * Length: 3, Null count: 0
> > > > > * Null bitmap buffer: Not required
> > > > > * Values array (byte array):
> > > > >     * Length: 6,  Null count: 0
> > > > >     * Null bitmap buffer: Not required
> > > > >     * Value buffer: [1, 2, 3, 4, 5, 6, <unspecified padding bytes>]
> > > > >
> > > > > So we can deduce that the row length is 6/3=2.
> > > > >
> > > > > It looks to me like it would be fairly easy to add support for
> this.
> > > > E.g.,
> > > > > in the FixedSizeListArray constructor in c++, if
> list_type()->list_size()
> > > > > is -1, then set list_size_ to values.length()/length.  There would
> be no
> > > > > changes to the schema.fbs/message.fbs files -- we would just be
> > > > assigning a
> > > > > meaning to something that's currently meaningless (having
> > > > > FixedSizeList.listSize=-1).
> > > > >
> > > > > If there's support for adding this to Arrow, then I could put
> together a
> > > > > PR.
> > > > >
> > > > > Thanks,
> > > > > -Edward
> > > > >
> > > > > P.S. Apologies if this gets posted twice -- I sent it out a couple
> days
> > > > ago
> > > > > right before subscribing to the mailing list; but I don't see it
> on the
> > > > > archives, presumably because I wasn't subscribed yet when I sent
> it out.
> > > > >
> > > >
>

Reply via email to