[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16566399#comment-16566399 ] Apache Spark commented on SPARK-21274: -- User 'dilipbiswal' has created a pull request for this issue: https://github.com/apache/spark/pull/21963 > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov >Assignee: Dilip Biswal >Priority: Major > Fix For: 2.4.0 > > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558687#comment-16558687 ] Apache Spark commented on SPARK-21274: -- User 'dilipbiswal' has created a pull request for this issue: https://github.com/apache/spark/pull/21886 > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16553924#comment-16553924 ] Apache Spark commented on SPARK-21274: -- User 'dilipbiswal' has created a pull request for this issue: https://github.com/apache/spark/pull/21857 > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551968#comment-16551968 ] Dilip Biswal commented on SPARK-21274: -- [~maropu] Thanks a lot for the info. > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551965#comment-16551965 ] Takeshi Yamamuro commented on SPARK-21274: -- ok, thanks! > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551954#comment-16551954 ] Dilip Biswal commented on SPARK-21274: -- [~maropu] Hi Takeshi, yeah.. So the code that does the rewrite is already in place. I am looking into a alternate way to ReplicateRows and will update in a few days. > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551921#comment-16551921 ] Takeshi Yamamuro commented on SPARK-21274: -- Anybody still working on this? > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16467056#comment-16467056 ] Takeshi Yamamuro commented on SPARK-21274: -- yea, I'm interested in the performance differences between the doc one and the presto one. I like a simpler rewriting rule and a faster query. > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465010#comment-16465010 ] Liang-Chi Hsieh commented on SPARK-21274: - [~dkbiswal] No problem. Current EXCEPT ALL rewrite is also one GROUP BY, I think it is fine to keep with. > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465008#comment-16465008 ] Dilip Biswal commented on SPARK-21274: -- [~viirya] Thank you. I will study this more tomorrow. From a quick peek, it looks reasonable to me. Since our current EXCEPT ALL rewrite is simpler, we can stay with that (given both hive and db2 do it the same way) ? And change the INTERSECT ALL rewrite after studying it a little more. > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465000#comment-16465000 ] Liang-Chi Hsieh commented on SPARK-21274: - I read the design doc. It looks correct to me. I found a rewrite rule in Presto for INTERSECT that seems more simple to me at [https://github.com/prestodb/presto/issues/4918#issuecomment-207106688.] That rule can be used to do INTERSECT ALL and EXCEPT ALL, if I don't miss anything. For example, to do INTERSECT ALL like: {{SELECT a FROM foo INTERSECT ALL SELECT x FROM bar}} {{We can rewrite it as:}} {code:java} SELECT a FROM ( SELECT replicate_row(min_count, a) AS (min_count, a) FROM ( SELECT a, COUNT(foo_marker) AS foo_cnt, COUNT(bar_marker) AS bar_cnt, IF(COUNT(foo_marker) > COUNT(bar_marker), COUNT(bar_marker), COUNT(foo_marker)) AS min_count FROM ( SELECT a, true as foo_marker, null as bar_marker FROM foo UNION ALL SELECT x, null as foo_marker, true as bar_marker FROM bar ) T1 GROUP BY a) T2 WHERE foo_cnt >= 1 AND bar_cnt >= 1 ) {code} One advantage of that rewrite rule is the rules of INTERSECT ALL and EXCEPT ALL are more similar to each other. Another one is for INTERSECT ALL, it only needs one GROUP BY instead of three GROUP BY in current design. WDYT? > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464448#comment-16464448 ] Apache Spark commented on SPARK-21274: -- User 'dilipbiswal' has created a pull request for this issue: https://github.com/apache/spark/pull/21240 > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16455734#comment-16455734 ] Dilip Biswal commented on SPARK-21274: -- [~maropu] Thanks. Yeah.. i will have two separate PRs. > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16455716#comment-16455716 ] Takeshi Yamamuro commented on SPARK-21274: -- ok, thanks! IMO it'd be better to make separate two prs to implement `EXCEPT ALL` and `INTERSECT ALL`. > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16455697#comment-16455697 ] Dilip Biswal commented on SPARK-21274: -- [~maropu] I am currently testing the code. I will open the PRs as soon as i finish. > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16455470#comment-16455470 ] Takeshi Yamamuro commented on SPARK-21274: -- Looks great to me. I checked the queries above worked well. Any plan to make a pr? > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16427689#comment-16427689 ] Dilip Biswal commented on SPARK-21274: -- Thank you [~smilegator] Here is the link. https://drive.google.com/open?id=1nyW0T0b_ajUduQoPgZLAsyHK8s3_dko3ulQuxaLpUXE > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16427603#comment-16427603 ] Xiao Li commented on SPARK-21274: - [~dkbiswal] [~ioana-delaney] Could you post your design doc here? > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16328028#comment-16328028 ] Takeshi Yamamuro commented on SPARK-21274: -- yea, I tried though, I couldn't find a rewriting rule into agg and join. Also, The proposed one in the description is wrong; {code:java} scala> Seq((0, 0), (1, 1), (2, 2), (2, 2)).toDF("a", "b").write.saveAsTable("b1") scala> Seq((1, 1), (1, 1), (2, 2)).toDF("a", "b").write.saveAsTable("b2") scala> sql(""" | SELECT * FROM b1 LEFT OUTER JOIN b2 ON b1.a = b2.a AND b1.b = b2.b | WHERE COALESCE(b2.a, b2.b) IS NULL | """).show +---+---+++ | a| b| a| b| +---+---+++ | 0| 0|null|null| +---+---+++ // The correct one is; postgres=# SELECT * FROM b1 EXCEPT ALL SELECT * FROM b2; a | b ---+--- 2 | 2 0 | 0 (2 rows) {code} I think the root cause of this difficulty is that we can't tell a difference between left-value and right-value duplication from outer-join results; {code:java} scala> sql(""" | SELECT * FROM b1 LEFT OUTER JOIN b2 ON b1.a = b2.a AND b1.b = b2.b | """).show +---+---+++ | a| b| a| b| +---+---+++ | 0| 0|null|null| | 1| 1| 1| 1| | 1| 1| 1| 1| | 2| 2| 2| 2| | 2| 2| 2| 2| +---+---+++ {code} > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: Optimizer, SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov >Priority: Major > Labels: set, sql > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16327453#comment-16327453 ] Reynold Xin commented on SPARK-21274: - Can't we rewrite this as two aggregates and a join? > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: Optimizer, SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov >Priority: Major > Labels: set, sql > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16326119#comment-16326119 ] Takeshi Yamamuro commented on SPARK-21274: -- Nobody takes this? If so, I'll do (Probably, this target might be v2.4 and I think we need not be in a hurry though). I looked around the related code and I think it's some difficult to rewrite EXCEPT/INTERSECT ALL queries into existing physical plans. So, IMHO we might need to add new physical plans for EXCEPT/INTERSECT ALL (This is my quick prototype for this ticket: https://github.com/apache/spark/compare/master...maropu:SetAllSpike). > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: Optimizer, SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov >Priority: Major > Labels: set, sql > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16107749#comment-16107749 ] Ruslan Dautkhanov commented on SPARK-21274: --- [~viirya], you're right. I've checked now on both PostgreSQL and Oracle. Sorry for the confusion. So the more generic rule is, [from Oracle documentation|https://docs.oracle.com/database/121/SQLRF/operators006.htm#sthref913]: {quote} For example, if a particular value occurs *m* times in nested_table1 and *n* times in nested_table2, then the result would contain the element *min(m,n)* times. ALL is the default. {quote} It's interesting that Oracle doesn't support "intersect all" on simple table sets, but only on nested table sets (through "multisets"): {code} CREATE TYPE sets_test_typ AS object ( num number ); CREATE TYPE sets_test_tab_typ AS TABLE OF sets_test_typ; with tab1 as (select 1 as z from dual union all select 2 from dual union all select 2 from dual union all select 2 from dual union all select 2 from dual ) , tab2 as ( select 1 as z from dual union all select 2 from dual union all select 2 from dual union all select 2 from dual ) SELECT * FROM table( cast(multiset(select z from tab1) as sets_test_tab_typ) multiset intersect ALL cast(multiset(select z from tab2) as sets_test_tab_typ) ) ; {code} So Oracle has returned "2" three times = min(3,4). Same test case in PostgreSQL: {code} scm=> with tab1 as (select 1 as z scm(> union all select 2 scm(> union all select 2 scm(> union all select 2 scm(> union all select 2 scm(> ) scm->, tab2 as ( scm(> select 1 as z scm(> union all select 2 scm(> union all select 2 scm(> union all select 2 scm(> ) scm-> SELECT z FROM tab1 scm-> INTERSECT all scm-> SELECT z FROM tab2 scm-> ; z --- 1 2 2 2 (4 rows) {code} The bottom line is that you're right, the above approach wouldn't work as you noticed. I still believe though it might be easier to implement except all / intersect all through query rewrite. For example, run group by on both sets, on target list of columns, do full outer join, find min between aggregate counts, and inject rows to final result set according that that min(n,m) count. > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: Optimizer, SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov > Labels: set, sql > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16105919#comment-16105919 ] Liang-Chi Hsieh commented on SPARK-21274: - [~Tagar] I've tried the query on PostgreSQL, the answer of [1, 2, 2] intersect_all [1, 2] is [1, 2]. So I think it's correct? How do we know we need to change the tables when rewriting the intersect query? > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: Optimizer, SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov > Labels: set, sql > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16105127#comment-16105127 ] Ruslan Dautkhanov commented on SPARK-21274: --- [~viirya], yes it returns {noformat}[1, 2, 2]{noformat} for both of the queries. I don't think [1, 2] is the correct behavior for the first query. EXCEPT ALL which returns all records from the *first* table which are not present in the second table, leaving the duplicates as is. > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: Optimizer, SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov > Labels: set, sql > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16102969#comment-16102969 ] Liang-Chi Hsieh commented on SPARK-21274: - [~Tagar] Is the rewrite of INTERSECT ALL correct? Take the example at https://github.com/apache/spark/pull/11106#issuecomment-182603275: {code} [1, 2, 2] intersect_all [1, 2] == [1, 2] [1, 2, 2] intersect_all [1, 2, 2] == [1, 2, 2] {code} Looks like the rewrite returns [1, 2, 2] for two queries. Isn't? Or I misread something? > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: Optimizer, SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov > Labels: set, sql > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070982#comment-16070982 ] Ruslan Dautkhanov commented on SPARK-21274: --- [~rxin], I wish I could. We only use PySpark and SQL API to Spark. Hope this will be useful to someone versatile with Scala. Thank you for considering this. > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: Optimizer, SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov > Labels: set, sql > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070890#comment-16070890 ] Reynold Xin commented on SPARK-21274: - Do you want to submit a pull request? > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: Optimizer, SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov > Labels: set, sql > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070884#comment-16070884 ] Ruslan Dautkhanov commented on SPARK-21274: --- For INTERSECT ALL I was also experimenting with OUTER JOINs and NOT IN, but set on NOT EXISTS anti-joins as it was giving best performance results (tested both on Spark 2.1). > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: Optimizer, SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov > Labels: set, sql > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org