[GitHub] [spark] c21 commented on pull request #35552: [SPARK-38237][SQL][SS] Introduce a new config to require all cluster keys on Aggregate

GitBox Fri, 18 Feb 2022 10:16:05 -0800


c21 commented on pull request #35552:
URL: https://github.com/apache/spark/pull/35552#issuecomment-1044953786



   > for (1) and (2), undesirable situation can happen beyond the two. E.g., 
the skew raised in a join output for a many-to-many join;
   
   @sigmod - I agree the join output can have data skew. If we talk about 
aggregate followed by join on subset of keys (`join(t1.x = t2.x)` followed by 
`aggregate(t1.x, t1.y)`) , the partial aggregate would be the major cost again 
same as the example in 
https://github.com/apache/spark/pull/35552#issuecomment-1044101219 . I am 
worried if the feature introduced here actually fix the problem or not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] c21 commented on pull request #35552: [SPARK-38237][SQL][SS] Introduce a new config to require all cluster keys on Aggregate

Reply via email to