[ 
https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465000#comment-16465000
 ] 

Liang-Chi Hsieh commented on SPARK-21274:
-----------------------------------------

I read the design doc. It looks correct to me. I found a rewrite rule in Presto 
for INTERSECT that seems more simple to me at 
[https://github.com/prestodb/presto/issues/4918#issuecomment-207106688.]

That rule can be used to do INTERSECT ALL and EXCEPT ALL, if I don't miss 
anything.

For example, to do INTERSECT ALL like:

{{SELECT a FROM foo INTERSECT ALL SELECT x FROM bar}}

{{We can rewrite it as:}}
{code:java}
SELECT a FROM (
  SELECT replicate_row(min_count, a) AS (min_count, a) FROM (
    SELECT a, COUNT(foo_marker) AS foo_cnt, COUNT(bar_marker) AS bar_cnt, 
IF(COUNT(foo_marker) > COUNT(bar_marker), COUNT(bar_marker), COUNT(foo_marker)) 
AS min_count
    FROM (
      SELECT a, true as foo_marker, null as bar_marker FROM foo
      UNION ALL
      SELECT x, null as foo_marker, true as bar_marker FROM bar
    ) T1
    GROUP BY a) T2
  WHERE foo_cnt >= 1 AND bar_cnt >= 1
)
{code}
 

One advantage of that rewrite rule is the rules of INTERSECT ALL and EXCEPT ALL 
are more similar to each other.

Another one is for INTERSECT ALL, it only needs one GROUP BY instead of three 
GROUP BY in current design.

WDYT?

 

 

 

> Implement EXCEPT ALL and INTERSECT ALL
> --------------------------------------
>
>                 Key: SPARK-21274
>                 URL: https://issues.apache.org/jira/browse/SPARK-21274
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 2.0.0, 2.1.0, 2.2.0
>            Reporter: Ruslan Dautkhanov
>            Priority: Major
>
> 1) *EXCEPT ALL* / MINUS ALL :
> {code}
> SELECT a,b,c FROM tab1
>  EXCEPT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following outer join:
> {code}
> SELECT a,b,c
> FROM    tab1 t1
>      LEFT OUTER JOIN 
>         tab2 t2
>      ON (
>         (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
>      )
> WHERE
>     COALESCE(t2.a, t2.b, t2.c) IS NULL
> {code}
> (register as a temp.view this second query under "*t1_except_t2_df*" name 
> that can be also used to find INTERSECT ALL below):
> 2) *INTERSECT ALL*:
> {code}
> SELECT a,b,c FROM tab1
>  INTERSECT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following anti-join using t1_except_t2_df we defined 
> above:
> {code}
> SELECT a,b,c
> FROM    tab1 t1
> WHERE 
>    NOT EXISTS
>    (    SELECT 1
>         FROM    t1_except_t2_df e
>         WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
>    )
> {code}
> So the suggestion is just to use above query rewrites to implement both 
> EXCEPT ALL and INTERSECT ALL sql set operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to