Yeah, I didn't quite follow the example either; it seems like your example
actually corresponds to a FixedSizeList<FixedSizeList<Binary>[2]>[2]? Or
perhaps FixedSizeList<List<Binary>>[2]? Assuming the former, it seems you'd
need additional fixed size slots to account for the Null element. In Julia,
you can inspect the internal structure of this like:
julia> c = [missing, ( ([0x00], [0x01, 0x02]), ([0x03, 0x04], [0x05]))]
2-element Vector{Union{Missing, Tuple{Tuple{Vector{UInt8}, Vector{UInt8}},
Tuple{Vector{UInt8}, Vector{UInt8}}}}}:
missing
((UInt8[0x00], UInt8[0x01, 0x02]), (UInt8[0x03, 0x04], UInt8[0x05]))
julia> ac = Arrow.toarrowvector(c)
2-element Arrow.FixedSizeList{Union{Missing, Tuple{Tuple{Vector{UInt8},
Vector{UInt8}}, Tuple{Vector{UInt8}, Vector{UInt8}}}},
Arrow.FixedSizeList{Tuple{Vector{UInt8}, Vector{UInt8}},
Arrow.List{Vector{UInt8}, Int32, Arrow.ToList{UInt8, false, Vector{UInt8},
Int32}}}}:
missing
((UInt8[0x00], UInt8[0x01, 0x02]), (UInt8[0x03, 0x04], UInt8[0x05]))
# binary list data
julia> ac.data.data.data
10-element Arrow.ToList{UInt8, false, Vector{UInt8}, Int32}:
0x00
0x00
0x00
0x00
0x00
0x01
0x02
0x03
0x04
0x05
# binary list offsets
julia> ac.data.data.offsets
8-element Arrow.Offsets{Int32}:
(1, 1)
(2, 2)
(3, 3)
(4, 4)
(5, 5)
(6, 7)
(8, 9)
(10, 10)
On Sun, Feb 21, 2021 at 1:38 AM Jorge Cardoso Leitão <
[email protected]> wrote:
> Hi,
>
> We state in the spec that:
>
> A fixed size list type is specified like FixedSizeList<T>[N], where T is
> > any type (*primitive or nested*) and N is a 32-bit signed integer
> > representing the length of the lists.
> >
>
> (emphasis mine)
>
> Now, suppose that we have FixedSizeList<Binary>[2], i.e. a fixed type whose
> inner is a variable sized type, as follows
>
> [
> Null,
> [
> [[0], [1, 2]],
> [[3, 4], [5]],
> ]
> ]
>
> Looking at the offsets of the binary, two options seem possible according
> to the spec:
>
> 1. [0, 1, 3, 5, 6] (i.e. inner has len = 4)
> 2. [0, 0, 0, 1, 3, 5, 6] (i.e. inner has len = 6)
>
> The difference in behavior emerges whenever we want to access the values of
> the i'th slot of the fixed list, e.g. [ [[0], [1, 2]], [[3, 4], [5]] ]
> above.
>
> With option 1, we can't slice the inner using `[i * 2, (i + 1) * 2]`: for i
> = 1 this would correspond to the offsets `[3, 5, 6, out of bounds]` (the
> result would still be wrong if this was in bounds, as it excluded the
> `[[0], [1, 2]]`). In this case, we need to count the number of nulls,
> `nulls`, up to `i` and take `[(i - nulls) * 2, (i - nulls + 1) * 2]`.
>
> If we use option 2, we can slice the binary directly using `[i * 2, (i + 1)
> * 2]`: for i = 1, this would correspond to the offsets `[0, 1, 3, 5, 6]`,
> which is correct.
>
> The challenge here is that there is no way to tell whether the inner array
> fulfills this "sliceability" constraint or not. I can't find this
> constraint in the spec. Do we enforce it somewhere? Note that this behavior
> only affects FixedSizeList, but it does affect all variations whose inner
> has a variable size (List, Binary, Utf8, etc).
>
> Any ideas?
>
> Best,
> Jorge
>