Re: [PR] Introduce morsel-driven Parquet scan [datafusion]

via GitHub Sun, 01 Mar 2026 21:19:26 -0800


Dandandan commented on PR #20481:
URL: https://github.com/apache/datafusion/pull/20481#issuecomment-3982191931


   > > > I want to test one alternative:
   > > > Instead of creating a (relatively) complex queue / synchronisation 
mechanism we could also do this:
   > > > ```
   > > > * harness the partitions (i.e. each file/rowfilter = 1 partition)
   > > > 
   > > > * add another node which merges m input partitions (~morsel) into n 
target partitions (number of threads)
   > > > 
   > > > * just pull from different input partitions based on a atomic 
partition (round-robin style)
   > > > ```
   > > 
   > > 
   > > This is basically what we do in our system. We use a parallelism of 128 
and then repartition that into num_threads. But this is more for the "I have 
10k files" case, not the "I have 1 file and 12 cores" case.
   > 
   > I poked around a bit and came up with this: 
https://gist.github.com/adriangb/ae5c0ebc56d0e72a955cdbfe5dd7e23f
   > 
   > TLDR try to make a single mpmc pipeline of open -> moreselize -> read 
morsel <- partitions pull
   
   Cool. I think there might be some "problems" with this approach.
   
   > Metadata fetches are
   > /// I/O-bound
   
   I think this is probably retrieved from the code, but not really true in 
practice (metadata involves decoding / summarizing / ... as well).
   
   I think there might be some benefits from the current approach in this PR:
   
   * we minimize the `ParquetMetadata` (with pages) in flight, if disabling 
cache it means we can make the memory overhead smaller when handling queries.
   * we have exact control over read order, e.g. now we read rowgroups 
*globally* in order. For object store we could change the strategy to e.g. 
interleave reading from different files (as object store (S3) does'nt like too 
much requests to the same file)
   * by having two queues of *to morselize* files and *morsels* we can 
introduce extra (bounded) concurrency later if we want
   
   
   I am coming to the conclusion that the difference between:
   
   * morselize pre-splitted file and directly picking first rowgroup, pushing 
rest back on queue (faster for a few queries, up to 3.5x, but a few small 
regressions)
   * morselize entire files only, always pushing back to queue (current 
approach, seems to have no regressioms but some of the speedups are lower)
   
   is mostly due to some randomness / luckiness of more selectiveness coming 
from particular files. I think we could probably look in the future more how to 
optimize file order / rowgroup order by picking the most "selective" files 
early as morsels, so dynamic filters become selective early with the least IO 
for next scans.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Introduce morsel-driven Parquet scan [datafusion]

Reply via email to