[jira] [Comment Edited] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL

Takeshi Yamamuro (JIRA) Tue, 16 Jan 2018 16:06:32 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16328028#comment-16328028
 ]


Takeshi Yamamuro edited comment on SPARK-21274 at 1/17/18 12:05 AM:
--------------------------------------------------------------------

yea, I tried though, I couldn't find a rewriting rule into agg and join.
 Also, the proposed one in the description is wrong;
{code:java}
scala> Seq((0, 0), (1, 1), (2, 2), (2, 2)).toDF("a", 
"b").write.saveAsTable("b1")
scala> Seq((1, 1), (1, 1), (2, 2)).toDF("a", "b").write.saveAsTable("b2")
scala> sql("""
     |   SELECT * FROM b1 LEFT OUTER JOIN b2 ON b1.a = b2.a AND b1.b = b2.b 
     |   WHERE COALESCE(b2.a, b2.b) IS NULL
     | """).show
+---+---+----+----+
|  a|  b|   a|   b|
+---+---+----+----+
|  0|  0|null|null|
+---+---+----+----+
// The correct one is;
postgres=# SELECT * FROM b1 EXCEPT ALL SELECT * FROM b2;
 a | b 
---+---
 2 | 2
 0 | 0
(2 rows)
{code}
I think the root cause of this difficulty is that we can't tell a difference 
between left-value and right-value duplication from outer-join results;
{code:java}
scala> sql("""
     |   SELECT * FROM b1 LEFT OUTER JOIN b2 ON b1.a = b2.a AND b1.b = b2.b 
     | """).show
+---+---+----+----+
|  a|  b|   a|   b|
+---+---+----+----+
|  0|  0|null|null|
|  1|  1|   1|   1|
|  1|  1|   1|   1|
|  2|  2|   2|   2|
|  2|  2|   2|   2|
+---+---+----+----+
{code}


was (Author: maropu):
yea, I tried though, I couldn't find a rewriting rule into agg and join.
 Also, The proposed one in the description is wrong;
{code:java}
scala> Seq((0, 0), (1, 1), (2, 2), (2, 2)).toDF("a", 
"b").write.saveAsTable("b1")
scala> Seq((1, 1), (1, 1), (2, 2)).toDF("a", "b").write.saveAsTable("b2")
scala> sql("""
     |   SELECT * FROM b1 LEFT OUTER JOIN b2 ON b1.a = b2.a AND b1.b = b2.b 
     |   WHERE COALESCE(b2.a, b2.b) IS NULL
     | """).show
+---+---+----+----+
|  a|  b|   a|   b|
+---+---+----+----+
|  0|  0|null|null|
+---+---+----+----+
// The correct one is;
postgres=# SELECT * FROM b1 EXCEPT ALL SELECT * FROM b2;
 a | b 
---+---
 2 | 2
 0 | 0
(2 rows)
{code}
I think the root cause of this difficulty is that we can't tell a difference 
between left-value and right-value duplication from outer-join results;
{code:java}
scala> sql("""
     |   SELECT * FROM b1 LEFT OUTER JOIN b2 ON b1.a = b2.a AND b1.b = b2.b 
     | """).show
+---+---+----+----+
|  a|  b|   a|   b|
+---+---+----+----+
|  0|  0|null|null|
|  1|  1|   1|   1|
|  1|  1|   1|   1|
|  2|  2|   2|   2|
|  2|  2|   2|   2|
+---+---+----+----+
{code}

> Implement EXCEPT ALL and INTERSECT ALL
> --------------------------------------
>
>                 Key: SPARK-21274
>                 URL: https://issues.apache.org/jira/browse/SPARK-21274
>             Project: Spark
>          Issue Type: New Feature
>          Components: Optimizer, SQL
>    Affects Versions: 2.0.0, 2.1.0, 2.2.0
>            Reporter: Ruslan Dautkhanov
>            Priority: Major
>              Labels: set, sql
>
> 1) *EXCEPT ALL* / MINUS ALL :
> {code}
> SELECT a,b,c FROM tab1
>  EXCEPT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following outer join:
> {code}
> SELECT a,b,c
> FROM    tab1 t1
>      LEFT OUTER JOIN 
>         tab2 t2
>      ON (
>         (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
>      )
> WHERE
>     COALESCE(t2.a, t2.b, t2.c) IS NULL
> {code}
> (register as a temp.view this second query under "*t1_except_t2_df*" name 
> that can be also used to find INTERSECT ALL below):
> 2) *INTERSECT ALL*:
> {code}
> SELECT a,b,c FROM tab1
>  INTERSECT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following anti-join using t1_except_t2_df we defined 
> above:
> {code}
> SELECT a,b,c
> FROM    tab1 t1
> WHERE 
>    NOT EXISTS
>    (    SELECT 1
>         FROM    t1_except_t2_df e
>         WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
>    )
> {code}
> So the suggestion is just to use above query rewrites to implement both 
> EXCEPT ALL and INTERSECT ALL sql set operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL

Reply via email to