Dandandan commented on PR #20481: URL: https://github.com/apache/datafusion/pull/20481#issuecomment-3982191931
> > > I want to test one alternative: > > > Instead of creating a (relatively) complex queue / synchronisation mechanism we could also do this: > > > ``` > > > * harness the partitions (i.e. each file/rowfilter = 1 partition) > > > > > > * add another node which merges m input partitions (~morsel) into n target partitions (number of threads) > > > > > > * just pull from different input partitions based on a atomic partition (round-robin style) > > > ``` > > > > > > This is basically what we do in our system. We use a parallelism of 128 and then repartition that into num_threads. But this is more for the "I have 10k files" case, not the "I have 1 file and 12 cores" case. > > I poked around a bit and came up with this: https://gist.github.com/adriangb/ae5c0ebc56d0e72a955cdbfe5dd7e23f > > TLDR try to make a single mpmc pipeline of open -> moreselize -> read morsel <- partitions pull Cool. I think there might be some "problems" with this approach. > Metadata fetches are > /// I/O-bound I think this is probably retrieved from the code, but not really true in practice (metadata involves decoding / summarizing / ... as well). I think there might be some benefits from the current approach in this PR: * we minimize the `ParquetMetadata` (with pages) in flight, if disabling cache it means we can make the memory overhead smaller when handling queries. * we have exact control over read order, e.g. now we read rowgroups *globally* in order. For object store we could change the strategy to e.g. interleave reading from different files (as object store (S3) does'nt like too much requests to the same file) * by having two queues of *to morselize* files and *morsels* we can introduce extra (bounded) concurrency later if we want I am coming to the conclusion that the difference between: * morselize pre-splitted file and directly picking first rowgroup, pushing rest back on queue (faster for a few queries, up to 3.5x, but a few small regressions) * morselize entire files only, always pushing back to queue (current approach, seems to have no regressioms but some of the speedups are lower) is mostly due to some randomness / luckiness of more selectiveness coming from particular files. I think we could probably look in the future more how to optimize file order / rowgroup order by picking the most "selective" files early as morsels, so dynamic filters become selective early with the least IO for next scans. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
