[jira] [Updated] (CALCITE-6927) Join condition remove IS NOT DISTINCT FROM

Zhen Chen (Jira) Sat, 05 Apr 2025 10:42:24 -0700


     [ 
https://issues.apache.org/jira/browse/CALCITE-6927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Zhen Chen updated CALCITE-6927:
-------------------------------
    Description: 
By referring to the conversion method of spark, IS NOT DISTINCT FROM can be 
converted to `(coalesce(x, '') = coalesce(y, '') ) and (isnull( x ) = isnull( y 
))` so that the join with IS NOT DISTINCT FROM condition can be used HashJoin 
instead of NestedLoopJoin when converting the logical plan to the physical 
plan.  
The sql is as follows:
{code:java}
explain 
select t1.age from user_profiles as t1 
join user_profiles t2 
on t1.user_id <=> t2.user_id;  {code}
The spark plan is as follows:
{code:java}
AdaptiveSparkPlan isFinalPlan=false
+- Project [age#6]
   +- BroadcastHashJoin [coalesce(user_id#5, ), isnull(user_id#5)], 
[coalesce(user_id#29, ), isnull(user_id#29)], Inner, BuildRight, false
      :- FileScan orc default.user_profiles[user_id#5,age#6] Batched: true, 
Bucketed: false (disabled by query planner), DataFilters: [], Format: ORC, 
Location: InMemoryFileIndex(1 paths)[file:..., PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct<user_id:string,age:int>
      +- BroadcastExchange HashedRelationBroadcastMode(List(coalesce(input[0, 
string, true], ), isnull(input[0, string, true])),false), [plan_id=72]
         +- FileScan orc default.user_profiles[user_id#29] Batched: true, 
Bucketed: false (disabled by query planner), DataFilters: [], Format: ORC, 
Location: InMemoryFileIndex(1 paths)[file:..., PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct<user_id:string>{code}

  was:
By referring to the conversion method of spark, IS NOT DISTINCT FROM can be 
converted to `(coalesce(x, '') = coalesce(y, '') ) and (isnull(x) = isnull(y))` 
so that the join with IS NOT DISTINCT FROM condition can be used HashJoin 
instead of NestedLoopJoin when converting the logical plan to the physical 
plan.  
The sql is as follows:
{code:java}
explain 
select t1.age from user_profiles as t1 
join user_profiles t2 
on t1.user_id <=> t2.user_id;  {code}
The spark plan is as follows:
{code:java}
AdaptiveSparkPlan isFinalPlan=false
+- Project [age#6]
   +- BroadcastHashJoin [coalesce(user_id#5, ), isnull(user_id#5)], 
[coalesce(user_id#29, ), isnull(user_id#29)], Inner, BuildRight, false
      :- FileScan orc default.user_profiles[user_id#5,age#6] Batched: true, 
Bucketed: false (disabled by query planner), DataFilters: [], Format: ORC, 
Location: InMemoryFileIndex(1 paths)[file:..., PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct<user_id:string,age:int>
      +- BroadcastExchange HashedRelationBroadcastMode(List(coalesce(input[0, 
string, true], ), isnull(input[0, string, true])),false), [plan_id=72]
         +- FileScan orc default.user_profiles[user_id#29] Batched: true, 
Bucketed: false (disabled by query planner), DataFilters: [], Format: ORC, 
Location: InMemoryFileIndex(1 paths)[file:..., PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct<user_id:string>{code}


> Join condition remove IS NOT DISTINCT FROM
> ------------------------------------------
>
>                 Key: CALCITE-6927
>                 URL: https://issues.apache.org/jira/browse/CALCITE-6927
>             Project: Calcite
>          Issue Type: Improvement
>            Reporter: Zhen Chen
>            Assignee: Zhen Chen
>            Priority: Major
>
> By referring to the conversion method of spark, IS NOT DISTINCT FROM can be 
> converted to `(coalesce(x, '') = coalesce(y, '') ) and (isnull( x ) = isnull( 
> y ))` so that the join with IS NOT DISTINCT FROM condition can be used 
> HashJoin instead of NestedLoopJoin when converting the logical plan to the 
> physical plan.  
> The sql is as follows:
> {code:java}
> explain 
> select t1.age from user_profiles as t1 
> join user_profiles t2 
> on t1.user_id <=> t2.user_id;  {code}
> The spark plan is as follows:
> {code:java}
> AdaptiveSparkPlan isFinalPlan=false
> +- Project [age#6]
>    +- BroadcastHashJoin [coalesce(user_id#5, ), isnull(user_id#5)], 
> [coalesce(user_id#29, ), isnull(user_id#29)], Inner, BuildRight, false
>       :- FileScan orc default.user_profiles[user_id#5,age#6] Batched: true, 
> Bucketed: false (disabled by query planner), DataFilters: [], Format: ORC, 
> Location: InMemoryFileIndex(1 paths)[file:..., PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct<user_id:string,age:int>
>       +- BroadcastExchange HashedRelationBroadcastMode(List(coalesce(input[0, 
> string, true], ), isnull(input[0, string, true])),false), [plan_id=72]
>          +- FileScan orc default.user_profiles[user_id#29] Batched: true, 
> Bucketed: false (disabled by query planner), DataFilters: [], Format: ORC, 
> Location: InMemoryFileIndex(1 paths)[file:..., PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct<user_id:string>{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (CALCITE-6927) Join condition remove IS NOT DISTINCT FROM

Reply via email to