msirek commented on PR #8271: URL: https://github.com/apache/arrow-datafusion/pull/8271#issuecomment-1924646196
> I took the liberty of merging the code in the PR up from main. Thank you! > Some of the plans look like they have regressed and are no longer using HashJoin with equality predicates. > Perhaps we can use https://docs.rs/datafusion/latest/datafusion/physical_plan/joins/struct.HashJoinExec.html#method.null_equals_null or similar to mark the joins correctly and still use eqi-joins but with different null matching semantics The semantics of `null_equals_null` can't really be re-used here because it only works for comparing 2 nulls, as I understand it. https://github.com/apache/arrow-datafusion/blob/d594e6257b34a5ad47112e26d41516aaeb19e6dd/datafusion/physical-plan/src/joins/hash_join.rs#L1081-L1084 https://arrow.apache.org/rust/arrow/compute/kernels/cmp/fn.not_distinct.html For example, in this test case: ```sql SELECT 'Found' WHERE 1 NOT IN (NULL, 2, 3); 0 rows in set. ``` The result of evaluating "1 IS NOT DISTINCT FROM NULL" is FALSE, the same as when comparing 1 with 2 and 3, so the NOT IN result would be TRUE (but should be NULL). There is a good description of the needed semantics in [null-aware antijoin](https://facebookincubator.github.io/velox/develop/anti-join.html), referenced in #8241. For the general case: ```sql SELECT * FROM t1 WHERE nullable_column1 NOT IN (SELECT nullable_column2 FROM t2); ``` The presence of any NULL in the subquery means the predicate should always evaluate to NULL for every outer row, so the antijoin would need to be taught to detect the presence of a NULL in the subquery and return no rows. Similarly, the presence of NULL in an outer row could short circuit and avoid joining that row with the subquery (always evaluates to NULL). For multicolumn NOT IN it would be more complicated. If we don't want any plan regressions, a null-aware flavor of hash antijoin seems to be needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
