nathanb9 commented on PR #21995: URL: https://github.com/apache/datafusion/pull/21995#issuecomment-4401744429
I did some more testing and groupjoin has more opportunities for optimization BUT they require things like more accurate filter cardinality estimates, and cost based join ordering data to apply them safely without regressions. So I don't think the additional optimizations (https://github.com/apache/datafusion/pull/22058) can be introduced with any guarantees rn. BUT still really think like the safe groupjoin optimization opportunity because the common pattern which is the simple fact + dimension equi join followed by a groupby gets nice performance benefits by avoiding the extra work. @Dandandan @2010YOUY01? any thoughts you guys have about this? Im considering taking PR out of draft but wondering if this has other implications in datafusion I should test out? Also, I listed out when its safe to introduce it into plan in DF with guarantees that performance cant regress. <details> <summary>Required to introduce basic GroupJoin into plan which will just memoize (NO ADDITIONAL optimization like eager join....):</summary> 1. AggregateExec directly above HashJoinExec 2. Aggregate mode is Single, SinglePartitioned, or Partial 3. At least one aggregate function 4. No GROUPING SETS 5. Join type is Inner or Left 6. Equi-join only 7. GROUP BY keys include the join equi-keys 8. Extra GROUP BY keys come from the build (left) side only Required to avoid regression (basic memoizing GroupJoin): 1. All aggregates support GroupsAccumulator 2. No ORDER BY on aggregates 3. Not a trivial COUNT(*) on Inner join 4. High build-side utilization bc most build-side hash table entries must get matched by probe rows. Otherwise the hash table is larger than the separate plan's aggregate table and you lose. Item 4 is the only real performance guard for basic memoizing. Since we can't easily figure out how selective the join. We can only allow memoizing for Left joins (guaranteed 100% utilization) or Inner joins where NDV suggests high match rate possibly</details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
