lyang24 opened a new issue, #9764:
URL: https://github.com/apache/arrow-rs/issues/9764
I’ve been looking into readers in parquets, and I think there’s an
architectural problem worth discussing.
Today we effectively have three different byte-fetch/control-flow shapes:
- sync ParquetRecordBatchReaderBuilder::build()
- push decoder / DataRequestBuilder
- async wrapping the push-decoder path
That makes it hard to do any of the following cleanly:
- share fixes across reader paths
- reason about performance regressions/wins
- evaluate backend changes like pread / batched range fetch / mmap
Do we want a shared internal artifact for “what bytes should this row group
read next”, and then separate executors for sync / push / async?
Very roughly:
- planner: metadata + projection + current selection + available chunks +
optional offset index -> planned byte ranges
- executor: fetch those ranges
- assembly: map fetched bytes back into InMemoryRowGroup / column chunks
- existing decode logic stays mostly where it is
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]