[ 
https://issues.apache.org/jira/browse/HIVE-20867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675678#comment-16675678
 ] 

Gopal V edited comment on HIVE-20867 at 11/5/18 7:41 PM:
---------------------------------------------------------

bq. Comparing with left-semi join one, we need to do the join one by one.

That is true at the logical level, but the physical runtime uses the bloom 
filter semi-join before data-gets shuffled & the join operations actually do 
not force a full sort (while the group-by does).

The current implementation's achilles heel is when the two branches are unequal 
in size - if you run something like

{code}
select ss_item_sk as k from store_sales
intersects 
select i_item_sk  as k from item where i_category = 'Sports'
{code}

vs

{code}
select distinct ss_item_sk as k from store_sales where ss_item_sk IN (select 
i_item_sk  as k from item where i_category = 'Sports').
{code}

you can see the vast difference between the two approaches.

However, I think this might need a null-safe semi-join, because within 
intersects null == null.


was (Author: gopalv):
bq. Comparing with left-semi join one, we need to do the join one by one.

That is true at the logical level, but the physical runtime uses the bloom 
filter semi-join before data-gets shuffled & the join operations actually do 
not force a full sort (while the group-by does).

The current implementation's achilles heel is when the two branches are unequal 
in size - if you run something like

{code}
select ss_item_sk as k from store_sales
intersects 
select i_item_sk  as k from item where i_category = 'Sports'
{code}

vs

{code}
select ss_item_sk as k from store_sales where ss_item_sk IN (select i_item_sk  
as k from item where i_category = 'Sports').
{code}

you can see the vast difference between the two approaches.

However, I think this might need a null-safe semi-join, because within 
intersects null == null.

> Rewrite INTERSECT into LEFT SEMI JOIN instead of UNION + Group by
> -----------------------------------------------------------------
>
>                 Key: HIVE-20867
>                 URL: https://issues.apache.org/jira/browse/HIVE-20867
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Planning
>    Affects Versions: 4.0.0
>            Reporter: Vineet Garg
>            Assignee: Vineet Garg
>            Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to