darmie commented on issue #20324: URL: https://github.com/apache/datafusion/issues/20324#issuecomment-3941410790
> One other direction I am exploring is to see if morsel-driven execution can help here. > > One hypothesis is that filter pushdown pushes more CPU work (especially in the case of dynamic queries) and serial IO (i.e. each individual RowFilter) + some additional overhead so slow / skewed partitions will become even more slow. > > With morsel-driven execution we might be able to mitigate this effect, as we can distribute the work better by planning the work using a queue (and so any overhead or file IO latencies will be spread out more). > > PoC is here [#20477](https://github.com/apache/datafusion/pull/20477) - it seems it gives quite a bit of speedups on Clickbench(!) (without filter pushdown) though I see some large slowdowns on TPCH SF10 as well, probably as it doesn't benefit much (as far as I remember data / filters are perfectly distributed and files seem to contain many row groups) and probably hurts locality as implemented. Is the TPC-H regression purely cache locality, or is there queue contention overhead too? Curious whether the ClickBench speedups hold when combined with pushdown enabled, the interaction between morsel scheduling and filter-induced work variance could compound. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
