HeartSaVioR edited a comment on pull request #32875: URL: https://github.com/apache/spark/pull/32875#issuecomment-1023813343
> Avoid shuffle when joining DSv2(s) bucketed in the way different than Spark bucketing. This already assumes DSv2 bucketing algorithm can be different from Spark's one and Spark avoids shuffle for this case. It is of course a great improvement in general, but in streaming context, state partitioning is not considered here. Given state partitioning is not flexible and we are yet to make metadata on state to have the information of the partitioning, our major assumption of the state partitioning is using Spark's internal hash function with the number of shuffle partition as the number of partition. If there is any case the assumption can be broken, we are yet to allow the case for streaming query. That said, there's a room for improvement. > So I think if data source bucketed as same as join keys, stream-stream join should be able to avoid shuffle before this PR. > After this PR, the stream-stream join behavior is not changed, it should still be able to avoid shuffle. I meant shuffle must be performed for stream-stream join if the source doesn't follow the Spark's internal hash function to retain the major assumption. Same for other stateful operators. That said, we may have already broken these cases since we didn't change these operators in this PR. I assume it was due to interchangeability between ClusteredDistribution and HashClusteredDistribution - in older version of Spark I found no difference between twos. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org