c21 commented on pull request #35552: URL: https://github.com/apache/spark/pull/35552#issuecomment-1044953786
> for (1) and (2), undesirable situation can happen beyond the two. E.g., the skew raised in a join output for a many-to-many join; @sigmod - I agree the join output can have data skew. If we talk about aggregate followed by join on subset of keys (`join(t1.x = t2.x)` followed by `aggregate(t1.x, t1.y)`) , the partial aggregate would be the major cost again same as the example in https://github.com/apache/spark/pull/35552#issuecomment-1044101219 . I am worried if the feature introduced here actually fix the problem or not. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
