adriangb commented on PR #20481: URL: https://github.com/apache/datafusion/pull/20481#issuecomment-3981366495
> > I want to test one alternative: > > Instead of creating a (relatively) complex queue / synchronisation mechanism we could also do this: > > ``` > > * harness the partitions (i.e. each file/rowfilter = 1 partition) > > > > * add another node which merges m input partitions (~morsel) into n target partitions (number of threads) > > > > * just pull from different input partitions based on a atomic partition (round-robin style) > > ``` > > This is basically what we do in our system. We use a parallelism of 128 and then repartition that into num_threads. But this is more for the "I have 10k files" case, not the "I have 1 file and 12 cores" case. I poked around a bit and came up with this: https://gist.github.com/adriangb/ae5c0ebc56d0e72a955cdbfe5dd7e23f TLDR try to make a single mpmc pipeline of open -> moreselize -> read morsel <- partitions pull -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
