By the way, this behavior is what make supporting fixed size lists while
reading parquet tricky.

On Mon, Feb 22, 2021 at 11:35 AM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Thank you both for your time and for your answers. I've made a mistake;
> you are right :-)
>
> Best,
> Jorge
>
>
>
> On Sun, Feb 21, 2021 at 1:39 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> Hi Jorge,
>> I'm not sure I understand your example, but I would expect any child array
>> of a fixed size list to always have N*size of fixed size list elements.
>> So
>> for:
>> [
>>   null,
>>   [a, bc],
>>   [de, feg]
>> ]
>>
>> (i.e. FixedSizeList<Binary>(2) where length 3.  with the first element is
>> null)
>>
>> I would expect the child array to have [0, 0, 0, 1, 3, 5, 8]  as its
>> indices (a total logical length=6).
>>
>> Which I think corresponds to your second representation of the child
>> array?  C++ Validates FixedSizeLists in its validate method to meet this
>> conditions [1]
>>
>> We should probably clarify the specification.
>>
>> -Micah
>>
>> [1]
>>
>> https://github.com/apache/arrow/blob/995abdc02fed412bbd947fe41a0765036dbbe820/cpp/src/arrow/array/validate.cc#L103
>>
>>
>>
>>
>> On Sun, Feb 21, 2021 at 12:38 AM Jorge Cardoso Leitão <
>> jorgecarlei...@gmail.com> wrote:
>>
>> > Hi,
>> >
>> > We state in the spec that:
>> >
>> > A fixed size list type is specified like FixedSizeList<T>[N], where T is
>> > > any type (*primitive or nested*) and N is a 32-bit signed integer
>> > > representing the length of the lists.
>> > >
>> >
>> > (emphasis mine)
>> >
>> > Now, suppose that we have FixedSizeList<Binary>[2], i.e. a fixed type
>> whose
>> > inner is a variable sized type, as follows
>> >
>> > [
>> >     Null,
>> >     [
>> >         [[0], [1, 2]],
>> >         [[3, 4], [5]],
>> >     ]
>> > ]
>> >
>> > Looking at the offsets of the binary, two options seem possible
>> according
>> > to the spec:
>> >
>> > 1. [0, 1, 3, 5, 6]  (i.e. inner has len = 4)
>> > 2. [0, 0, 0, 1, 3, 5, 6]  (i.e. inner has len = 6)
>> >
>> > The difference in behavior emerges whenever we want to access the
>> values of
>> > the i'th slot of the fixed list, e.g. [ [[0], [1, 2]], [[3, 4], [5]] ]
>> > above.
>> >
>> > With option 1, we can't slice the inner using `[i * 2, (i + 1) * 2]`:
>> for i
>> > = 1 this would correspond to the offsets `[3, 5, 6, out of bounds]` (the
>> > result would still be wrong if this was in bounds, as it excluded the
>> > `[[0], [1, 2]]`). In this case, we need to count the number of nulls,
>> > `nulls`, up to `i` and take `[(i - nulls) * 2, (i - nulls + 1) * 2]`.
>> >
>> > If we use option 2, we can slice the binary directly using `[i * 2, (i
>> + 1)
>> > * 2]`: for i = 1, this would correspond to the offsets `[0, 1, 3, 5,
>> 6]`,
>> > which is correct.
>> >
>> > The challenge here is that there is no way to tell whether the inner
>> array
>> > fulfills this "sliceability" constraint or not. I can't find this
>> > constraint in the spec. Do we enforce it somewhere? Note that this
>> behavior
>> > only affects FixedSizeList, but it does affect all variations whose
>> inner
>> > has a variable size (List, Binary, Utf8, etc).
>> >
>> > Any ideas?
>> >
>> > Best,
>> > Jorge
>> >
>>
>

Reply via email to