nevi-me commented on issue #1750: URL: https://github.com/apache/arrow-rs/issues/1750#issuecomment-1153269508
I've read this thread a few times, but I'm still hazy on what a good approach is, given how out of the loop I have been for so long. We had discussed with Jorge many moons ago that passing the offset and length to Buffer and Bitmap would be a good solution (as is done in arrow2 like you mention @tustvold). I haven't written arrow code in very long, so I can't quite remember the details. However, what I recall was having a challenge figuring out what happens in the below scenario. An array is of type `struct[a]<struct[b]<struct[c]<struct[d]<int32[e]>>>>` and we slice it, what happens when we select `a.b.c`? The trouble was that if we don't pass down the offset and length to the `ArrayData` of `a`'s children, we'd be bound to always knowing `a`'s offset, which forces us to compute it each time we access `a` or any of its children. So in principle I favoured pushing down the offset at the time. Which I suppose has led us here: > ArrayData::Slice contains a special case for StructArray where it recurses the offset into its children. However, it preserves the offset on the parent ArrayData, in order for the validity buffer to work correctly. ___ >There are longer term suggestions around handling offsets in ArrayData differently, but until then I would like to propose: > * Remove the ArrayData::slice special-case > * Slice child data within StructArray when constructing boxed_fields This makes sense to implement as a solution (interim?), but yea perhaps first-prize would be propagating offsets and value lengths to a redesigned Buffer and Bitmap -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
