[GitHub] [arrow] lidavidm edited a comment on pull request #10145: ARROW-12522: [C++] Add ReadRangeCache::WaitFor

GitBox Tue, 27 Apr 2021 10:35:49 -0700


lidavidm edited a comment on pull request #10145:
URL: https://github.com/apache/arrow/pull/10145#issuecomment-827784250



   > I wonder if, given a bunch of small record batches, we might sometimes 
want to coalesce across record batches. I think the current design preempts 
that. Although I think there would be more challenges than just this tool to 
tackle that problem.
   
   So overall, the use pattern for this class is:
   
   1. `Cache()` all byte ranges you expect to read in the future, in the 
granularity that you expect to read them. So you'd call `Cache` for every 
record batch (IPC), or for every column chunk (Parquet).
   2. `WaitFor()` the ranges that you need. For IPC, this would again be one 
record batch; for Parquet, this would be one row group's worth of column 
chunks. This can be done in parallel/reentrantly and is why we need the lock in 
the lazy variant.
   3. `Read` the ranges that you need.
   
   Since all the byte ranges are given up front, you do get coalescing across 
record batches/column chunks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] lidavidm edited a comment on pull request #10145: ARROW-12522: [C++] Add ReadRangeCache::WaitFor

Reply via email to