By the way, this behavior is what make supporting fixed size lists while reading parquet tricky.
On Mon, Feb 22, 2021 at 11:35 AM Jorge Cardoso Leitão < jorgecarlei...@gmail.com> wrote: > Thank you both for your time and for your answers. I've made a mistake; > you are right :-) > > Best, > Jorge > > > > On Sun, Feb 21, 2021 at 1:39 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> Hi Jorge, >> I'm not sure I understand your example, but I would expect any child array >> of a fixed size list to always have N*size of fixed size list elements. >> So >> for: >> [ >> null, >> [a, bc], >> [de, feg] >> ] >> >> (i.e. FixedSizeList<Binary>(2) where length 3. with the first element is >> null) >> >> I would expect the child array to have [0, 0, 0, 1, 3, 5, 8] as its >> indices (a total logical length=6). >> >> Which I think corresponds to your second representation of the child >> array? C++ Validates FixedSizeLists in its validate method to meet this >> conditions [1] >> >> We should probably clarify the specification. >> >> -Micah >> >> [1] >> >> https://github.com/apache/arrow/blob/995abdc02fed412bbd947fe41a0765036dbbe820/cpp/src/arrow/array/validate.cc#L103 >> >> >> >> >> On Sun, Feb 21, 2021 at 12:38 AM Jorge Cardoso Leitão < >> jorgecarlei...@gmail.com> wrote: >> >> > Hi, >> > >> > We state in the spec that: >> > >> > A fixed size list type is specified like FixedSizeList<T>[N], where T is >> > > any type (*primitive or nested*) and N is a 32-bit signed integer >> > > representing the length of the lists. >> > > >> > >> > (emphasis mine) >> > >> > Now, suppose that we have FixedSizeList<Binary>[2], i.e. a fixed type >> whose >> > inner is a variable sized type, as follows >> > >> > [ >> > Null, >> > [ >> > [[0], [1, 2]], >> > [[3, 4], [5]], >> > ] >> > ] >> > >> > Looking at the offsets of the binary, two options seem possible >> according >> > to the spec: >> > >> > 1. [0, 1, 3, 5, 6] (i.e. inner has len = 4) >> > 2. [0, 0, 0, 1, 3, 5, 6] (i.e. inner has len = 6) >> > >> > The difference in behavior emerges whenever we want to access the >> values of >> > the i'th slot of the fixed list, e.g. [ [[0], [1, 2]], [[3, 4], [5]] ] >> > above. >> > >> > With option 1, we can't slice the inner using `[i * 2, (i + 1) * 2]`: >> for i >> > = 1 this would correspond to the offsets `[3, 5, 6, out of bounds]` (the >> > result would still be wrong if this was in bounds, as it excluded the >> > `[[0], [1, 2]]`). In this case, we need to count the number of nulls, >> > `nulls`, up to `i` and take `[(i - nulls) * 2, (i - nulls + 1) * 2]`. >> > >> > If we use option 2, we can slice the binary directly using `[i * 2, (i >> + 1) >> > * 2]`: for i = 1, this would correspond to the offsets `[0, 1, 3, 5, >> 6]`, >> > which is correct. >> > >> > The challenge here is that there is no way to tell whether the inner >> array >> > fulfills this "sliceability" constraint or not. I can't find this >> > constraint in the spec. Do we enforce it somewhere? Note that this >> behavior >> > only affects FixedSizeList, but it does affect all variations whose >> inner >> > has a variable size (List, Binary, Utf8, etc). >> > >> > Any ideas? >> > >> > Best, >> > Jorge >> > >> >