peasee opened a new issue, #18936:
URL: https://github.com/apache/datafusion/issues/18936

   ### Is your feature request related to a problem or challenge?
   
   I have a scenario where I'm joining some large datasets together, and the 
returned values for the dynamic filters from the hash join are very disparate - 
resulting in more data being scanned than is necessary from the right side of 
the join.
   
   For example, imagine a query involving TPCH tables `orders` and `lineitem`. 
`orders` is the left side of the join and returns 2 rows - one row with 
`o_orderkey=1` and the second with `o_orderkey=5900000`. These are simplified 
to a range filter like `l_orderkey >= 1 AND l_orderkey <= 5900000` - resulting 
in nearly the entirety of the `lineitem` table being scanned when only 2 rows 
would've sufficed.
   
   ### Describe the solution you'd like
   
   I would like `CollectLeftAccumulator` to be a trait, so that I can override 
it using a physical plan optimizer by replacing the `HashJoinExec` with a new 
`HashJoinExec` containing my custom left side accumulator.
   
   The default accumulator would remain the min-max style accumulator that 
currently exists.
   
   Context: my custom left-side accumulator at the moment simply constructs a 
`IN` expression of the identified keys (e.g. `IN (1, 5900000)`). I am aware the 
pruning predicate builder does not support `IN` expressions with more than 20 
elements, I am using my own custom pruner which supports larger `IN` 
expressions so this is not a problem for me.
   
   ### Describe alternatives you've considered
   
   There could be an argument to be made that we should consider alternative 
bounds accumulation techniques, like clustering, etc. However, it is my opinion 
that making the left side accumulator a trait would make it simpler to test and 
implement future changes like these to add varying styles of accumulators.
   
   ### Additional context
   
   I have already implemented this in a fork: 
https://github.com/spiceai/datafusion/pull/116
   
   My proposal would basically be upstreaming this PR into DataFusion `trunk`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to