ThomasAWalker opened a new issue, #12299:
URL: https://github.com/apache/datafusion/issues/12299

   ### Is your feature request related to a problem or challenge?
   
   It would be nice if the optimiser could use hash joins when queries contain 
IS NOT DISTINCT FROM - with null_equals_null behaviour
   
   e.g. as it is at the moment 
   
   ```
   SELECT t1.label, t1.v1*t2.v2 FROM t1 JOIN t2 ON t1.label=t2.label;
   ```
   will use a hash join while
   
   ```
   SELECT t1.label, t1.v1*t2.v2 FROM t1 JOIN t2 ON t1.label IS NOT DISTINCT 
FROM t2.label;
   ```
   will use a nested loop
   
   
   below
   
   ```
   > CREATE table t1(label text, v1 int);
   0 row(s) fetched. 
   Elapsed 0.004 seconds.
   
   > CREATE table t2(label text, v2 int);
   0 row(s) fetched. 
   Elapsed 0.002 seconds.
   
   > EXPLAIN SELECT t1.label, t1.v1*t2.v2 FROM t1 JOIN t2 ON t1.label=t2.label; 
   
+---------------+----------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                    |
   
+---------------+----------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Projection: t1.label, t1.v1 * t2.v2                        
                                                    |
   |               |   Inner Join: t1.label = t2.label                          
                                                    |
   |               |     TableScan: t1 projection=[label, v1]                   
                                                    |
   |               |     TableScan: t2 projection=[label, v2]                   
                                                    |
   | physical_plan | ProjectionExec: expr=[label@0 as label, v1@1 * v2@2 as 
t1.v1 * t2.v2]                                          |
   |               |   CoalesceBatchesExec: target_batch_size=8192              
                                                    |
   |               |     HashJoinExec: mode=Partitioned, join_type=Inner, 
on=[(label@0, label@0)], projection=[label@0, v1@1, v2@3] |
   |               |       CoalesceBatchesExec: target_batch_size=8192          
                                                    |
   |               |         RepartitionExec: partitioning=Hash([label@0], 16), 
input_partitions=1                                  |
   |               |           MemoryExec: partitions=1, partition_sizes=[0]    
                                                    |
   |               |       CoalesceBatchesExec: target_batch_size=8192          
                                                    |
   |               |         RepartitionExec: partitioning=Hash([label@0], 16), 
input_partitions=1                                  |
   |               |           MemoryExec: partitions=1, partition_sizes=[0]    
                                                    |
   |               |                                                            
                                                    |
   
+---------------+----------------------------------------------------------------------------------------------------------------+
   2 row(s) fetched. 
   Elapsed 0.004 seconds.
   
   > EXPLAIN SELECT t1.label, t1.v1*t2.v2 FROM t1 JOIN t2 ON t1.label IS NOT 
DISTINCT FROM t2.label; 
   
+---------------+--------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                          |
   
+---------------+--------------------------------------------------------------------------------------+
   | logical_plan  | Projection: t1.label, t1.v1 * t2.v2                        
                          |
   |               |   Inner Join:  Filter: t1.label IS NOT DISTINCT FROM 
t2.label                        |
   |               |     TableScan: t1 projection=[label, v1]                   
                          |
   |               |     TableScan: t2 projection=[label, v2]                   
                          |
   | physical_plan | ProjectionExec: expr=[label@0 as label, v1@1 * v2@3 as 
t1.v1 * t2.v2]                |
   |               |   RepartitionExec: partitioning=RoundRobinBatch(16), 
input_partitions=1              |
   |               |     NestedLoopJoinExec: join_type=Inner, filter=label@0 IS 
NOT DISTINCT FROM label@1 |
   |               |       MemoryExec: partitions=1, partition_sizes=[0]        
                          |
   |               |       MemoryExec: partitions=1, partition_sizes=[0]        
                          |
   |               |                                                            
                          |
   
+---------------+--------------------------------------------------------------------------------------+
   2 row(s) fetched. 
   Elapsed 0.003 seconds.
   ```
   
   ### Describe the solution you'd like
   
   When optimising the logical plan if ExtractEquijoinPredicate encounters a 
join with no existing conditions in the "on" vector and no results for 
split_eq_and_noneq_join_predicate then do a similar check for "IS NOT DISTINCT 
FROM". If this returns some conditions then push these into the "on" vector and 
set null_equals_null to true.
   
   We've got a change like this on our fork of DF. We also had to add a few 
checks to some optimiser rules to stop them rewriting joins without checking 
null_equals_null.
   
   ### Describe alternatives you've considered
   
   I don't know if this could potentially be left to a later optimisation pass.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to