Re: [PR] Add group join physical optimizer [datafusion]

via GitHub Thu, 07 May 2026 15:51:10 -0700


nathanb9 commented on PR #21995:
URL: https://github.com/apache/datafusion/pull/21995#issuecomment-4401744429


   I did some more testing and groupjoin has more opportunities for 
optimization BUT they require things like more accurate filter cardinality 
estimates, and cost based join ordering data to apply them safely without 
regressions. So I don't think the additional optimizations 
(https://github.com/apache/datafusion/pull/22058) can be introduced with any 
guarantees rn.
   
   BUT still really think like the safe groupjoin optimization opportunity 
because the common pattern which is the simple fact + dimension equi join 
followed by a groupby gets nice performance benefits by avoiding the extra work.
   @Dandandan @2010YOUY01? any thoughts you guys have about this?
   Im considering taking PR out of draft but wondering if this has other 
implications in datafusion I should test out? 
   
   Also, I listed out when its safe to introduce it into plan in DF with 
guarantees that performance cant regress.
   <details>
   <summary>Required to introduce basic GroupJoin into plan which will just 
memoize (NO ADDITIONAL optimization like eager join....):</summary>
   
     1. AggregateExec directly above HashJoinExec
     2. Aggregate mode is Single, SinglePartitioned, or Partial
     3. At least one aggregate function
     4. No GROUPING SETS
     5. Join type is Inner or Left
     6. Equi-join only
     7. GROUP BY keys include the join equi-keys
     8. Extra GROUP BY keys come from the build (left) side only
   
   Required to avoid regression (basic memoizing GroupJoin):
   
     1. All aggregates support GroupsAccumulator
     2. No ORDER BY on aggregates
     3. Not a trivial COUNT(*) on Inner join
     4. High build-side utilization bc most build-side hash table entries must 
get matched by probe rows. Otherwise the hash table is larger than the separate 
plan's aggregate table and you lose.
   
   Item 4 is the only real performance guard for basic memoizing. Since we 
can't easily figure out how selective the join. We can only allow memoizing for 
Left joins (guaranteed 100% utilization) or Inner joins where NDV suggests high 
match rate possibly</details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add group join physical optimizer [datafusion]

Reply via email to