peasee opened a new issue, #18936: URL: https://github.com/apache/datafusion/issues/18936
### Is your feature request related to a problem or challenge? I have a scenario where I'm joining some large datasets together, and the returned values for the dynamic filters from the hash join are very disparate - resulting in more data being scanned than is necessary from the right side of the join. For example, imagine a query involving TPCH tables `orders` and `lineitem`. `orders` is the left side of the join and returns 2 rows - one row with `o_orderkey=1` and the second with `o_orderkey=5900000`. These are simplified to a range filter like `l_orderkey >= 1 AND l_orderkey <= 5900000` - resulting in nearly the entirety of the `lineitem` table being scanned when only 2 rows would've sufficed. ### Describe the solution you'd like I would like `CollectLeftAccumulator` to be a trait, so that I can override it using a physical plan optimizer by replacing the `HashJoinExec` with a new `HashJoinExec` containing my custom left side accumulator. The default accumulator would remain the min-max style accumulator that currently exists. Context: my custom left-side accumulator at the moment simply constructs a `IN` expression of the identified keys (e.g. `IN (1, 5900000)`). I am aware the pruning predicate builder does not support `IN` expressions with more than 20 elements, I am using my own custom pruner which supports larger `IN` expressions so this is not a problem for me. ### Describe alternatives you've considered There could be an argument to be made that we should consider alternative bounds accumulation techniques, like clustering, etc. However, it is my opinion that making the left side accumulator a trait would make it simpler to test and implement future changes like these to add varying styles of accumulators. ### Additional context I have already implemented this in a fork: https://github.com/spiceai/datafusion/pull/116 My proposal would basically be upstreaming this PR into DataFusion `trunk` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
