msirek commented on PR #8271:
URL: 
https://github.com/apache/arrow-datafusion/pull/8271#issuecomment-1924646196

   > I took the liberty of merging the code in the PR up from main.
   
   Thank you!
   
   > Some of the plans look like they have regressed and are no longer using 
HashJoin with equality predicates.
   > Perhaps we can use 
https://docs.rs/datafusion/latest/datafusion/physical_plan/joins/struct.HashJoinExec.html#method.null_equals_null
 or similar to mark the joins correctly and still use eqi-joins but with 
different null matching semantics
   
   The semantics of `null_equals_null` can't really be re-used here because it 
only works for comparing 2 nulls, as I understand it.
   
https://github.com/apache/arrow-datafusion/blob/d594e6257b34a5ad47112e26d41516aaeb19e6dd/datafusion/physical-plan/src/joins/hash_join.rs#L1081-L1084
   https://arrow.apache.org/rust/arrow/compute/kernels/cmp/fn.not_distinct.html
   
   For example, in this test case:
   ```sql
   SELECT 'Found' WHERE 1 NOT IN (NULL, 2, 3);
   0 rows in set.
   ```
   The result of evaluating "1 IS NOT DISTINCT FROM NULL" is FALSE, the same as 
when comparing 1 with 2 and 3, so the NOT IN result would be TRUE (but should 
be NULL).
   
   There is a good description of the needed semantics in [null-aware 
antijoin](https://facebookincubator.github.io/velox/develop/anti-join.html), 
referenced in #8241.
   
   For the general case:
   ```sql
   SELECT * FROM t1 WHERE nullable_column1 NOT IN (SELECT nullable_column2 FROM 
t2);
   ```
   The presence of any NULL in the subquery means the predicate should always 
evaluate to NULL for every outer row, so the antijoin would need to be taught 
to detect the presence of a NULL in the subquery and return no rows.
   Similarly, the presence of NULL in an outer row could short circuit and 
avoid joining that row with the subquery (always evaluates to NULL).
   For multicolumn NOT IN it would be more complicated.
   
   If we don't want any plan regressions, a null-aware flavor of hash antijoin 
seems to be needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to