[GitHub] [spark] c21 commented on pull request #32875: [SPARK-35703][SQL] Relax constraint for bucket join and remove HashClusteredDistribution

GitBox Thu, 27 Jan 2022 15:12:43 -0800


c21 commented on pull request #32875:
URL: https://github.com/apache/spark/pull/32875#issuecomment-1023726415



   I think this PR introduced two things:
   
   * Avoid shuffle when joining DSv2(s) bucketed in the way different than 
Spark bucketing.
     * NOTE: I think this PR itself does not complete the feature, more change 
is needed around `DataSourcePartitioning`.
   
   * Avoid shuffle when joining data sources bucketed in subset of join key(s).
     * This feature is disabled by default by followup - 
https://github.com/apache/spark/pull/35138 .
   
   So if we speak of the current master branch, I feel it should not break 
anything for streaming use cases.
   
   > when the distribution (bucket) keys exactly match join keys, so it seems 
the potential issue you mentioned above could occur before this PR too?
   
   > Would using HashClusteredDistribution "force" using Spark's internal hash 
function on distribution?
   
   `HashClusteredDistribution` forces to use Spark internal hash function - 
`Murmur3Hash`. Before this PR, the only way to bucket table is to use 
`HashPartitioning`, which has hardcoded hash function in 
`HashPartitioning.partitionIdExpression`. So I think if data source bucketed as 
same as join keys, stream-stream join should be able to avoid shuffle before 
this PR. After this PR, the stream-stream join behavior is not changed, it 
should still be able to avoid shuffle.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] c21 commented on pull request #32875: [SPARK-35703][SQL] Relax constraint for bucket join and remove HashClusteredDistribution

Reply via email to