c21 commented on pull request #32875:
URL: https://github.com/apache/spark/pull/32875#issuecomment-1023726415
I think this PR introduced two things:
* Avoid shuffle when joining DSv2(s) bucketed in the way different than
Spark bucketing.
* NOTE: I think this PR itself does not complete the feature, more change
is needed around `DataSourcePartitioning`.
* Avoid shuffle when joining data sources bucketed in subset of join key(s).
* This feature is disabled by default by followup -
https://github.com/apache/spark/pull/35138 .
So if we speak of the current master branch, I feel it should not break
anything for streaming use cases.
> when the distribution (bucket) keys exactly match join keys, so it seems
the potential issue you mentioned above could occur before this PR too?
> Would using HashClusteredDistribution "force" using Spark's internal hash
function on distribution?
`HashClusteredDistribution` forces to use Spark internal hash function -
`Murmur3Hash`. Before this PR, the only way to bucket table is to use
`HashPartitioning`, which has hardcoded hash function in
`HashPartitioning.partitionIdExpression`. So I think if data source bucketed as
same as join keys, stream-stream join should be able to avoid shuffle before
this PR. After this PR, the stream-stream join behavior is not changed, it
should still be able to avoid shuffle.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]