SChakravorti21 commented on issue #34535:
URL: https://github.com/apache/arrow/issues/34535#issuecomment-1977490469

   > There is a case for making `ChunkResolver` public already. Please create a 
separate issue to talk about `ChunkedArrayResolver`.
   
   For sure, I can make a separate issue for `ChunkedArrayResolver`. My bad for 
mixing these two discussions.
   
   > But my main problem with your code (and let me be more direct this time 
with what I mean by "random access") is that you're using `Resolve`/`Value` 
(which is `O(log(num_chunks)))` on every iteration when you could be 
incrementing each `ChunkLocation` in O(1) without having to rely on the caching 
in `Resolve` to make it "O(1) most of the time" + overhead.
   
   That makes sense, I didn't fully understand what you meant previously.
   
   I think the API additions you're suggesting make sense, but I'm confused how 
someone would use them to iterate over multiple columns simultaneously. Is 
there such a thing as a "typed `ChunkedArray`"? Otherwise how would we expect 
someone to access the values inside this loop:
   
   ```cpp
   ChunkResolver resolver(batches);
   
   for (ChunkLocation loc; resolver.Valid(loc); loc = resolved.Next(loc)) {
     // what is the most efficient way to access the values for each column 
here?
   }
   ```
   
   The benefit of just iterating over the batches themselves is that we only 
perform the cast from untyped `Array` to typed array (`Int64Array`, 
`StringArray`, etc.) once per column per batch. This is cheaper if someone 
prefers to use `dynamic_cast` for safety vs. doing a cast once per datapoint.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to