[GitHub] [spark] cloud-fan commented on pull request #32875: [SPARK-35703][SQL] Relax constraint for bucket join and remove HashClusteredDistribution

GitBox Thu, 27 Jan 2022 18:33:51 -0800


cloud-fan commented on pull request #32875:
URL: https://github.com/apache/spark/pull/32875#issuecomment-1023829161



   I think this is kind of a potential bug. Let's say that we have 2 tables 
that can report hash partitioning optionally (e.g. controlled by a flag). 
Assume a streaming query is first run with the flag off, which means the tables 
do not report hash partitioning, then Spark will add shuffles before the 
stream-stream join, and the join state (steaming checkpoint) is partitioned by 
Spark's murmur3 hash function. Then we restart the streaming query with the 
flag on, and the 2 tables report hash partitioning (not the same as Spark's 
murmur3). Spark will not add shuffles before stream-stream join this time, and 
leads to wrong result, because the left/right join child is not co-partitioned 
with the join state in the previous run.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan commented on pull request #32875: [SPARK-35703][SQL] Relax constraint for bucket join and remove HashClusteredDistribution

Reply via email to