ThomasAWalker opened a new issue, #12299:
URL: https://github.com/apache/datafusion/issues/12299
### Is your feature request related to a problem or challenge?
It would be nice if the optimiser could use hash joins when queries contain
IS NOT DISTINCT FROM - with null_equals_null behaviour
e.g. as it is at the moment
```
SELECT t1.label, t1.v1*t2.v2 FROM t1 JOIN t2 ON t1.label=t2.label;
```
will use a hash join while
```
SELECT t1.label, t1.v1*t2.v2 FROM t1 JOIN t2 ON t1.label IS NOT DISTINCT
FROM t2.label;
```
will use a nested loop
below
```
> CREATE table t1(label text, v1 int);
0 row(s) fetched.
Elapsed 0.004 seconds.
> CREATE table t2(label text, v2 int);
0 row(s) fetched.
Elapsed 0.002 seconds.
> EXPLAIN SELECT t1.label, t1.v1*t2.v2 FROM t1 JOIN t2 ON t1.label=t2.label;
+---------------+----------------------------------------------------------------------------------------------------------------+
| plan_type | plan
|
+---------------+----------------------------------------------------------------------------------------------------------------+
| logical_plan | Projection: t1.label, t1.v1 * t2.v2
|
| | Inner Join: t1.label = t2.label
|
| | TableScan: t1 projection=[label, v1]
|
| | TableScan: t2 projection=[label, v2]
|
| physical_plan | ProjectionExec: expr=[label@0 as label, v1@1 * v2@2 as
t1.v1 * t2.v2] |
| | CoalesceBatchesExec: target_batch_size=8192
|
| | HashJoinExec: mode=Partitioned, join_type=Inner,
on=[(label@0, label@0)], projection=[label@0, v1@1, v2@3] |
| | CoalesceBatchesExec: target_batch_size=8192
|
| | RepartitionExec: partitioning=Hash([label@0], 16),
input_partitions=1 |
| | MemoryExec: partitions=1, partition_sizes=[0]
|
| | CoalesceBatchesExec: target_batch_size=8192
|
| | RepartitionExec: partitioning=Hash([label@0], 16),
input_partitions=1 |
| | MemoryExec: partitions=1, partition_sizes=[0]
|
| |
|
+---------------+----------------------------------------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.004 seconds.
> EXPLAIN SELECT t1.label, t1.v1*t2.v2 FROM t1 JOIN t2 ON t1.label IS NOT
DISTINCT FROM t2.label;
+---------------+--------------------------------------------------------------------------------------+
| plan_type | plan
|
+---------------+--------------------------------------------------------------------------------------+
| logical_plan | Projection: t1.label, t1.v1 * t2.v2
|
| | Inner Join: Filter: t1.label IS NOT DISTINCT FROM
t2.label |
| | TableScan: t1 projection=[label, v1]
|
| | TableScan: t2 projection=[label, v2]
|
| physical_plan | ProjectionExec: expr=[label@0 as label, v1@1 * v2@3 as
t1.v1 * t2.v2] |
| | RepartitionExec: partitioning=RoundRobinBatch(16),
input_partitions=1 |
| | NestedLoopJoinExec: join_type=Inner, filter=label@0 IS
NOT DISTINCT FROM label@1 |
| | MemoryExec: partitions=1, partition_sizes=[0]
|
| | MemoryExec: partitions=1, partition_sizes=[0]
|
| |
|
+---------------+--------------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.003 seconds.
```
### Describe the solution you'd like
When optimising the logical plan if ExtractEquijoinPredicate encounters a
join with no existing conditions in the "on" vector and no results for
split_eq_and_noneq_join_predicate then do a similar check for "IS NOT DISTINCT
FROM". If this returns some conditions then push these into the "on" vector and
set null_equals_null to true.
We've got a change like this on our fork of DF. We also had to add a few
checks to some optimiser rules to stop them rewriting joins without checking
null_equals_null.
### Describe alternatives you've considered
I don't know if this could potentially be left to a later optimisation pass.
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]