[ 
https://issues.apache.org/jira/browse/CALCITE-6927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17939826#comment-17939826
 ] 

Zhen Chen commented on CALCITE-6927:
------------------------------------

I agree with you. If the hash-join can be well adapted to IS NOT DISTINCT FROM 
(I mean it can be used as an equal condition and special comparison for null), 
the efficiency should be very good. But Spark seems unable to do so, so it 
makes a conversion like this

> Add rule for join condition remove IS NOT DISTINCT FROM
> -------------------------------------------------------
>
>                 Key: CALCITE-6927
>                 URL: https://issues.apache.org/jira/browse/CALCITE-6927
>             Project: Calcite
>          Issue Type: Improvement
>            Reporter: Zhen Chen
>            Assignee: Zhen Chen
>            Priority: Major
>              Labels: pull-request-available
>
> By referring to the conversion method of spark, IS NOT DISTINCT FROM can be 
> converted to `(coalesce(x, '') = coalesce(y, '') ) and (isnull( x ) = isnull( 
> y ))` so that the join with IS NOT DISTINCT FROM condition can be used 
> HashJoin instead of NestedLoopJoin when converting the logical plan to the 
> physical plan.  
> The sql is as follows:
> {code:java}
> explain 
> select t1.age from user_profiles as t1 
> join user_profiles t2 
> on t1.user_id <=> t2.user_id;  {code}
> The spark plan is as follows:
> {code:java}
> AdaptiveSparkPlan isFinalPlan=false
> +- Project [age#6]
>    +- BroadcastHashJoin [coalesce(user_id#5, ), isnull(user_id#5)], 
> [coalesce(user_id#29, ), isnull(user_id#29)], Inner, BuildRight, false
>       :- FileScan orc default.user_profiles[user_id#5,age#6] Batched: true, 
> Bucketed: false (disabled by query planner), DataFilters: [], Format: ORC, 
> Location: InMemoryFileIndex(1 paths)[file:..., PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct<user_id:string,age:int>
>       +- BroadcastExchange HashedRelationBroadcastMode(List(coalesce(input[0, 
> string, true], ), isnull(input[0, string, true])),false), [plan_id=72]
>          +- FileScan orc default.user_profiles[user_id#29] Batched: true, 
> Bucketed: false (disabled by query planner), DataFilters: [], Format: ORC, 
> Location: InMemoryFileIndex(1 paths)[file:..., PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct<user_id:string>{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to