async readers converge on a shared physical read planner [arrow-rs]

via GitHub Sun, 19 Apr 2026 00:29:03 -0700


lyang24 opened a new issue, #9764:
URL: https://github.com/apache/arrow-rs/issues/9764


   I’ve been looking into readers in parquets, and I think there’s an 
architectural problem worth discussing.
   
   Today we effectively have three different byte-fetch/control-flow shapes:
   
     - sync ParquetRecordBatchReaderBuilder::build()
     - push decoder / DataRequestBuilder
     - async wrapping the push-decoder path
   
   That makes it hard to do any of the following cleanly:
   
     - share fixes across reader paths
     - reason about performance regressions/wins
     - evaluate backend changes like pread / batched range fetch / mmap 
     
   Do we want a shared internal artifact for “what bytes should this row group 
read next”, and then separate executors for sync / push / async?
   
   Very roughly:
   
     - planner: metadata + projection + current selection + available chunks + 
optional offset index -> planned byte ranges
     - executor: fetch those ranges
     - assembly: map fetched bytes back into InMemoryRowGroup / column chunks
     - existing decode logic stays mostly where it is
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] parquet/arrow: should sync/async readers converge on a shared physical read planner [arrow-rs]

Reply via email to