bersprockets opened a new pull request, #52135:
URL: https://github.com/apache/spark/pull/52135

   ### What changes were proposed in this pull request?
   
   This is a back-port of both #52067 and #52128.
   
   After 
https://github.com/apache/spark/commit/e861b0d93722f76cc103c05c7992c22c7fa23ad6,
 shuffle hash join for left semi/anti/existence will ignore duplicate keys if 
the join condition is empty or refers to the same parent attributes as the join 
keys. This PR proposes that duplicate keys should be ignored only when the join 
condition has these properties:
   
   1. a subtree that is a semantic match to a build-side key, and/or
   1. all attributes, outside of any subtree that is a semantic match to a 
build-side join key, should be from the stream-side.
   
   ### Why are the changes needed?
   
   
https://github.com/apache/spark/commit/e861b0d93722f76cc103c05c7992c22c7fa23ad6 
causes a correctness issue when a column is transformed in the build-side join 
keys and also transformed, but differently, in a join condition. As an example:
   ```
   create or replace temp view data(a) as values
   ("xxxx1111"),
   ("yyyy2222");
   
   create or replace temp view lookup(k) as values
   ("xxxx22"),
   ("xxxx33"),
   ("xxxx11");
   
   -- this returns one row
   select *
   from data
   left semi join lookup
   on substring(a, 1, 4) = substring(k, 1, 4)
   and substring(a, 1, 6) >= k;
   
   -- this is the same query as above, but with a shuffle hash join hint, and 
returns no rows
   select /*+ SHUFFLE_HASH(lookup) */ *
   from data
   left semi join lookup
   on substring(a, 1, 4) = substring(k, 1, 4)
   and substring(a, 1, 6) >= k;
   ```
   When the join uses broadcast hash join, the hashrelation of lookup has the 
following key -> values:
   ```
   Key xxxx:
     xxxx11
     xxxx33
     xxxx22
   ```
   The join condition matches on the build side row with the value `xxxx11`.
   
   When the join uses shuffle hash join, on the other hand, the hash relation 
of lookup has the following key -> values:
   ```
   Key xxxx:
     xxxx22
   ```
   Because the keys must be unique, an arbitrary row is chosen to represent the 
key, and that row does not match the join condition.
   
   After 
https://github.com/apache/spark/commit/1f35577a3ead9c6268b5ba47c2e3aec60484e3cc,
 a similar issue happens with integer keys:
   ```
   create or replace temp view data(a) as values
   (10000),
   (30000);
   
   create or replace temp view lookup(k) as values
   (1000),
   (1001),
   (1002),
   (1003),
   (1004);
   
   -- this query returns one row
   select * from data left semi join lookup on a/10000 = cast(k/1000 as int) 
and k >=  a/10 + 3;
   
   -- this is the same query as above, but with a shuffle hash join hint, and 
returns no rows
   select /*+ SHUFFLE_HASH(lookup) */ * from data left semi join lookup on 
a/10000 = cast(k/1000 as int) and k >=  a/10 + 3;
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, except for fixing the correctness issue.
   
   ### How was this patch tested?
   
   Modified an existing unit test.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to