[GitHub] [spark] HeartSaVioR edited a comment on pull request #32875: [SPARK-35703][SQL] Relax constraint for bucket join and remove HashClusteredDistribution

GitBox Thu, 27 Jan 2022 18:06:16 -0800


HeartSaVioR edited a comment on pull request #32875:
URL: https://github.com/apache/spark/pull/32875#issuecomment-1023813343



   > Avoid shuffle when joining DSv2(s) bucketed in the way different than 
Spark bucketing.
   
   This already assumes DSv2 bucketing algorithm can be different from Spark's 
one and Spark avoids shuffle for this case. It is of course a great improvement 
in general, but in streaming context, state partitioning is not considered here.
   
   Given state partitioning is not flexible and we are yet to make metadata on 
state to have the information of the partitioning, our major assumption of the 
state partitioning is using Spark's internal hash function with the number of 
shuffle partition as the number of partition. If there is any case the 
assumption can be broken, we are yet to allow the case for streaming query. 
That said, there's a room for improvement.
   
   > So I think if data source bucketed as same as join keys, stream-stream 
join should be able to avoid shuffle before this PR. > After this PR, the 
stream-stream join behavior is not changed, it should still be able to avoid 
shuffle.
   
   I meant shuffle must be performed for stream-stream join if the source 
doesn't follow the Spark's internal hash function to retain the major 
assumption. Same for other stateful operators. That said, we may have already 
broken these cases since we didn't change these operators in this PR. I assume 
it was due to interchangeability between ClusteredDistribution and 
HashClusteredDistribution - in older version of Spark I found no difference 
between twos.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR edited a comment on pull request #32875: [SPARK-35703][SQL] Relax constraint for bucket join and remove HashClusteredDistribution

Reply via email to