SChakravorti21 commented on issue #34535:
URL: https://github.com/apache/arrow/issues/34535#issuecomment-1977490469
> There is a case for making `ChunkResolver` public already. Please create a
separate issue to talk about `ChunkedArrayResolver`.
For sure, I can make a separate issue for `ChunkedArrayResolver`. My bad for
mixing these two discussions.
> But my main problem with your code (and let me be more direct this time
with what I mean by "random access") is that you're using `Resolve`/`Value`
(which is `O(log(num_chunks)))` on every iteration when you could be
incrementing each `ChunkLocation` in O(1) without having to rely on the caching
in `Resolve` to make it "O(1) most of the time" + overhead.
That makes sense, I didn't fully understand what you meant previously.
I think the API additions you're suggesting make sense, but I'm confused how
someone would use them to iterate over multiple columns simultaneously. Is
there such a thing as a "typed `ChunkedArray`"? Otherwise how would we expect
someone to access the values inside this loop:
```cpp
ChunkResolver resolver(batches);
for (ChunkLocation loc; resolver.Valid(loc); loc = resolved.Next(loc)) {
// what is the most efficient way to access the values for each column
here?
}
```
The benefit of just iterating over the batches themselves is that we only
perform the cast from untyped `Array` to typed array (`Int64Array`,
`StringArray`, etc.) once per column per batch. This is cheaper if someone
prefers to use `dynamic_cast` for safety vs. doing a cast once per datapoint.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]