[
https://issues.apache.org/jira/browse/HIVE-20867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675678#comment-16675678
]
Gopal V edited comment on HIVE-20867 at 11/5/18 7:41 PM:
---------------------------------------------------------
bq. Comparing with left-semi join one, we need to do the join one by one.
That is true at the logical level, but the physical runtime uses the bloom
filter semi-join before data-gets shuffled & the join operations actually do
not force a full sort (while the group-by does).
The current implementation's achilles heel is when the two branches are unequal
in size - if you run something like
{code}
select ss_item_sk as k from store_sales
intersects
select i_item_sk as k from item where i_category = 'Sports'
{code}
vs
{code}
select distinct ss_item_sk as k from store_sales where ss_item_sk IN (select
i_item_sk as k from item where i_category = 'Sports').
{code}
you can see the vast difference between the two approaches.
However, I think this might need a null-safe semi-join, because within
intersects null == null.
was (Author: gopalv):
bq. Comparing with left-semi join one, we need to do the join one by one.
That is true at the logical level, but the physical runtime uses the bloom
filter semi-join before data-gets shuffled & the join operations actually do
not force a full sort (while the group-by does).
The current implementation's achilles heel is when the two branches are unequal
in size - if you run something like
{code}
select ss_item_sk as k from store_sales
intersects
select i_item_sk as k from item where i_category = 'Sports'
{code}
vs
{code}
select ss_item_sk as k from store_sales where ss_item_sk IN (select i_item_sk
as k from item where i_category = 'Sports').
{code}
you can see the vast difference between the two approaches.
However, I think this might need a null-safe semi-join, because within
intersects null == null.
> Rewrite INTERSECT into LEFT SEMI JOIN instead of UNION + Group by
> -----------------------------------------------------------------
>
> Key: HIVE-20867
> URL: https://issues.apache.org/jira/browse/HIVE-20867
> Project: Hive
> Issue Type: Improvement
> Components: Query Planning
> Affects Versions: 4.0.0
> Reporter: Vineet Garg
> Assignee: Vineet Garg
> Priority: Major
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)