felipecrv commented on issue #34535:
URL: https://github.com/apache/arrow/issues/34535#issuecomment-1977304538

   >  it might be good enough to make ChunkResolver/ChunkedArrayResolver public
   
   There is a case for making `ChunkResolver` public already. Please create a 
separate issue to talk about `ChunkedArrayResolver`.
   
   Without knowing what `do_business_logic` is, I can't say it's impossible to 
process column by column (and I don't mean with vector instructions, just 
memory locality effects alone being beneficial). To use columnar representation 
well, the business logic has to be aware of the columnar representation.
   
   But my main problem with your code (and let me be more direct this time with 
what I mean by "random access") is that you're using `Resolve`/`Value` (which 
is `O(log(num_chunks))`) on every iteration when you could be incrementing each 
`ChunkLocation` in O(1) without having to rely on the caching in `Resolve` to 
make "O(1) most of the time" + overhead.
   
   To that, we can add these two to `ChunkResolver`:
   
   ```cpp
     /// \pre loc.chunk_index >= 0
     /// \pre loc.index_in_chunk is assumed valid if chunk_index is not the 
last one
     inline bool Valid(ChunkLocation loc) const {
       const int64_t last_chunk_index = static_cast<int64_t>(offsets_.size()) - 
1;
       return loc.chunk_index + 1 < last_chunk_index ||
              (loc.chunk_index + 1 == last_chunk_index &&
               loc.index_in_chunk < offsets_[last_chunk_index]);
     }
   
     /// \pre Valid(loc)
     inline ChunkLocation Next(ChunkLocation loc) const {
       const int64_t next_index_in_chunk = loc.index_in_chunk + 1;
       return (next_index_in_chunk < offsets_[loc.chunk_index + 1])
                  ? ChunkLocation{loc.chunk_index, next_index_in_chunk}
                  : ChunkLocation{loc.chunk_index + 1, 0};
     }
   ```
   
   Then your loops can be:
   
   ```cpp
   ChunkResolver resolver(batches);
   for (ChunkLocation loc; resolver.Valid(loc); loc = resolved.Next(loc)) {
     // re-use loc for all the typed columns since they are split on the same 
offsets
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to