Re: [PR] Introduce morsel-driven Parquet scan [datafusion]

via GitHub Sun, 01 Mar 2026 16:12:05 -0800


adriangb commented on PR #20481:
URL: https://github.com/apache/datafusion/pull/20481#issuecomment-3981366495


   > > I want to test one alternative:
   > > Instead of creating a (relatively) complex queue / synchronisation 
mechanism we could also do this:
   > > ```
   > > * harness the partitions (i.e. each file/rowfilter = 1 partition)
   > > 
   > > * add another node which merges m input partitions (~morsel) into n 
target partitions (number of threads)
   > > 
   > > * just pull from different input partitions based on a atomic partition 
(round-robin style)
   > > ```
   > 
   > This is basically what we do in our system. We use a parallelism of 128 
and then repartition that into num_threads. But this is more for the "I have 
10k files" case, not the "I have 1 file and 12 cores" case.
   
   I poked around a bit and came up with this: 
https://gist.github.com/adriangb/ae5c0ebc56d0e72a955cdbfe5dd7e23f
   
   TLDR try to make a single mpmc pipeline of open -> moreselize -> read morsel 
<- partitions pull


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Introduce morsel-driven Parquet scan [datafusion]

Reply via email to