[GitHub] [spark] imback82 commented on pull request #29655: [SPARK-32806][SQL] SortMergeJoin with partial hash distribution can be optimized to remove shuffle

GitBox Fri, 11 Sep 2020 16:45:19 -0700


imback82 commented on pull request #29655:
URL: https://github.com/apache/spark/pull/29655#issuecomment-691356366



   > My first thought is like the concerns as same as @hvanhovell in the 
previous discussion.
   
   Is the concern with the data skew, or are there any other concerns? I 
couldn't find more in the discussion.
   
   The main scenario that this PR is going after is to allow bucketed tables to 
be utilized by more workloads. Since bucketed tables are created by users, we 
rarely observed cases where users pick bucket columns with low cardinality - 
similar to how users pick partition columns.
   
   I could make the rule more restrictive to check if sources are bucketed 
tables. (btw, if this approach is fine, I could extend the rule to support 
`PartitioningCollection` - still making sure that sources are bucketed tables 
-, which would help removing shuffles in nested joins.) WDYT?
   
   cc: @c21 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] imback82 commented on pull request #29655: [SPARK-32806][SQL] SortMergeJoin with partial hash distribution can be optimized to remove shuffle

Reply via email to