[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16328028#comment-16328028 ]
Takeshi Yamamuro edited comment on SPARK-21274 at 1/17/18 12:05 AM: -------------------------------------------------------------------- yea, I tried though, I couldn't find a rewriting rule into agg and join. Also, the proposed one in the description is wrong; {code:java} scala> Seq((0, 0), (1, 1), (2, 2), (2, 2)).toDF("a", "b").write.saveAsTable("b1") scala> Seq((1, 1), (1, 1), (2, 2)).toDF("a", "b").write.saveAsTable("b2") scala> sql(""" | SELECT * FROM b1 LEFT OUTER JOIN b2 ON b1.a = b2.a AND b1.b = b2.b | WHERE COALESCE(b2.a, b2.b) IS NULL | """).show +---+---+----+----+ | a| b| a| b| +---+---+----+----+ | 0| 0|null|null| +---+---+----+----+ // The correct one is; postgres=# SELECT * FROM b1 EXCEPT ALL SELECT * FROM b2; a | b ---+--- 2 | 2 0 | 0 (2 rows) {code} I think the root cause of this difficulty is that we can't tell a difference between left-value and right-value duplication from outer-join results; {code:java} scala> sql(""" | SELECT * FROM b1 LEFT OUTER JOIN b2 ON b1.a = b2.a AND b1.b = b2.b | """).show +---+---+----+----+ | a| b| a| b| +---+---+----+----+ | 0| 0|null|null| | 1| 1| 1| 1| | 1| 1| 1| 1| | 2| 2| 2| 2| | 2| 2| 2| 2| +---+---+----+----+ {code} was (Author: maropu): yea, I tried though, I couldn't find a rewriting rule into agg and join. Also, The proposed one in the description is wrong; {code:java} scala> Seq((0, 0), (1, 1), (2, 2), (2, 2)).toDF("a", "b").write.saveAsTable("b1") scala> Seq((1, 1), (1, 1), (2, 2)).toDF("a", "b").write.saveAsTable("b2") scala> sql(""" | SELECT * FROM b1 LEFT OUTER JOIN b2 ON b1.a = b2.a AND b1.b = b2.b | WHERE COALESCE(b2.a, b2.b) IS NULL | """).show +---+---+----+----+ | a| b| a| b| +---+---+----+----+ | 0| 0|null|null| +---+---+----+----+ // The correct one is; postgres=# SELECT * FROM b1 EXCEPT ALL SELECT * FROM b2; a | b ---+--- 2 | 2 0 | 0 (2 rows) {code} I think the root cause of this difficulty is that we can't tell a difference between left-value and right-value duplication from outer-join results; {code:java} scala> sql(""" | SELECT * FROM b1 LEFT OUTER JOIN b2 ON b1.a = b2.a AND b1.b = b2.b | """).show +---+---+----+----+ | a| b| a| b| +---+---+----+----+ | 0| 0|null|null| | 1| 1| 1| 1| | 1| 1| 1| 1| | 2| 2| 2| 2| | 2| 2| 2| 2| +---+---+----+----+ {code} > Implement EXCEPT ALL and INTERSECT ALL > -------------------------------------- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: Optimizer, SQL > Affects Versions: 2.0.0, 2.1.0, 2.2.0 > Reporter: Ruslan Dautkhanov > Priority: Major > Labels: set, sql > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROM tab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROM tab1 t1 > WHERE > NOT EXISTS > ( SELECT 1 > FROM t1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) > ) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org