bersprockets opened a new pull request, #52135: URL: https://github.com/apache/spark/pull/52135
### What changes were proposed in this pull request? This is a back-port of both #52067 and #52128. After https://github.com/apache/spark/commit/e861b0d93722f76cc103c05c7992c22c7fa23ad6, shuffle hash join for left semi/anti/existence will ignore duplicate keys if the join condition is empty or refers to the same parent attributes as the join keys. This PR proposes that duplicate keys should be ignored only when the join condition has these properties: 1. a subtree that is a semantic match to a build-side key, and/or 1. all attributes, outside of any subtree that is a semantic match to a build-side join key, should be from the stream-side. ### Why are the changes needed? https://github.com/apache/spark/commit/e861b0d93722f76cc103c05c7992c22c7fa23ad6 causes a correctness issue when a column is transformed in the build-side join keys and also transformed, but differently, in a join condition. As an example: ``` create or replace temp view data(a) as values ("xxxx1111"), ("yyyy2222"); create or replace temp view lookup(k) as values ("xxxx22"), ("xxxx33"), ("xxxx11"); -- this returns one row select * from data left semi join lookup on substring(a, 1, 4) = substring(k, 1, 4) and substring(a, 1, 6) >= k; -- this is the same query as above, but with a shuffle hash join hint, and returns no rows select /*+ SHUFFLE_HASH(lookup) */ * from data left semi join lookup on substring(a, 1, 4) = substring(k, 1, 4) and substring(a, 1, 6) >= k; ``` When the join uses broadcast hash join, the hashrelation of lookup has the following key -> values: ``` Key xxxx: xxxx11 xxxx33 xxxx22 ``` The join condition matches on the build side row with the value `xxxx11`. When the join uses shuffle hash join, on the other hand, the hash relation of lookup has the following key -> values: ``` Key xxxx: xxxx22 ``` Because the keys must be unique, an arbitrary row is chosen to represent the key, and that row does not match the join condition. After https://github.com/apache/spark/commit/1f35577a3ead9c6268b5ba47c2e3aec60484e3cc, a similar issue happens with integer keys: ``` create or replace temp view data(a) as values (10000), (30000); create or replace temp view lookup(k) as values (1000), (1001), (1002), (1003), (1004); -- this query returns one row select * from data left semi join lookup on a/10000 = cast(k/1000 as int) and k >= a/10 + 3; -- this is the same query as above, but with a shuffle hash join hint, and returns no rows select /*+ SHUFFLE_HASH(lookup) */ * from data left semi join lookup on a/10000 = cast(k/1000 as int) and k >= a/10 + 3; ``` ### Does this PR introduce _any_ user-facing change? No, except for fixing the correctness issue. ### How was this patch tested? Modified an existing unit test. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org