sunchao edited a comment on pull request #32875: URL: https://github.com/apache/spark/pull/32875#issuecomment-1024493862
> Then we restart the streaming query with the flag on, and the 2 tables report hash partitioning (not the same as Spark's murmur3). One question @cloud-fan : is this already a correctness issue previously? say if one side of join reports `HashPartitioning` with non-murmur3 hash while the other side reports `HashPartitioning` with murmur3 hash (for instance, there's a Spark shuffle operator between the data source scan and join). I wonder if the issue can happen even if data sources report `HashPartitioning` with Spark's murmur3 hash. Thanks @HeartSaVioR for your comments, duly noted. Let me bring back `HashClusteredDistribution` then. I'll also add more comments to make it more future-proof and no partitioning other than `HashPartitioning` can satisfy it. Would you please provide a test suite for this potential issue? > Seems like DataSourcePartitioning doesn't allow the partitioning from data source to be satisfy HashClusteredDistribution - it only checks with ClusteredDistribution. That's correct. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org