Repository: spark Updated Branches: refs/heads/master b3f2911ee -> 73dd6cf9b
[SPARK-24966][SQL] Implement precedence rules for set operations. ## What changes were proposed in this pull request? Currently the set operations INTERSECT, UNION and EXCEPT are assigned the same precedence. This PR fixes the problem by giving INTERSECT higher precedence than UNION and EXCEPT. UNION and EXCEPT operators are evaluated in the order in which they appear in the query from left to right. This results in change in behavior because of the change in order of evaluations of set operators in a query. The old behavior is still preserved under a newly added config parameter. Query `:` ``` SELECT * FROM t1 UNION SELECT * FROM t2 EXCEPT SELECT * FROM t3 INTERSECT SELECT * FROM t4 ``` Parsed plan before the change `:` ``` == Parsed Logical Plan == 'Intersect false :- 'Except false : :- 'Distinct : : +- 'Union : : :- 'Project [*] : : : +- 'UnresolvedRelation `t1` : : +- 'Project [*] : : +- 'UnresolvedRelation `t2` : +- 'Project [*] : +- 'UnresolvedRelation `t3` +- 'Project [*] +- 'UnresolvedRelation `t4` ``` Parsed plan after the change `:` ``` == Parsed Logical Plan == 'Except false :- 'Distinct : +- 'Union : :- 'Project [*] : : +- 'UnresolvedRelation `t1` : +- 'Project [*] : +- 'UnresolvedRelation `t2` +- 'Intersect false :- 'Project [*] : +- 'UnresolvedRelation `t3` +- 'Project [*] +- 'UnresolvedRelation `t4` ``` ## How was this patch tested? Added tests in PlanParserSuite, SQLQueryTestSuite. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Dilip Biswal <dbis...@us.ibm.com> Closes #21941 from dilipbiswal/SPARK-24966. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/73dd6cf9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/73dd6cf9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/73dd6cf9 Branch: refs/heads/master Commit: 73dd6cf9b558f9d752e1f3c13584344257ad7863 Parents: b3f2911 Author: Dilip Biswal <dbis...@us.ibm.com> Authored: Thu Aug 2 22:04:17 2018 -0700 Committer: Xiao Li <gatorsm...@gmail.com> Committed: Thu Aug 2 22:04:17 2018 -0700 ---------------------------------------------------------------------- docs/sql-programming-guide.md | 1 + .../apache/spark/sql/catalyst/parser/SqlBase.g4 | 15 ++- .../apache/spark/sql/catalyst/dsl/package.scala | 6 +- .../spark/sql/catalyst/parser/ParseDriver.scala | 2 + .../plans/logical/basicLogicalOperators.scala | 6 +- .../org/apache/spark/sql/internal/SQLConf.scala | 12 +++ .../sql/catalyst/parser/PlanParserSuite.scala | 45 ++++++++ .../spark/sql/execution/SparkStrategies.scala | 4 +- .../sql-tests/inputs/intersect-all.sql | 51 +++++++-- .../sql-tests/results/intersect-all.sql.out | 104 +++++++++++++++---- 10 files changed, 211 insertions(+), 35 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/73dd6cf9/docs/sql-programming-guide.md ---------------------------------------------------------------------- diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index 0900f83..a1e019c 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -1876,6 +1876,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see ## Upgrading From Spark SQL 2.3 to 2.4 + - Since Spark 2.4, Spark will evaluate the set operations referenced in a query by following a precedence rule as per the SQL standard. If the order is not specified by parentheses, set operations are performed from left to right with the exception that all INTERSECT operations are performed before any UNION, EXCEPT or MINUS operations. The old behaviour of giving equal precedence to all the set operations are preserved under a newly added configuaration `spark.sql.legacy.setopsPrecedence.enabled` with a default value of `false`. When this property is set to `true`, spark will evaluate the set operators from left to right as they appear in the query given no explicit ordering is enforced by usage of parenthesis. - Since Spark 2.4, Spark will display table description column Last Access value as UNKNOWN when the value was Jan 01 1970. - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for ORC files by default. To do that, `spark.sql.orc.impl` and `spark.sql.orc.filterPushdown` change their default values to `native` and `true` respectively. - In PySpark, when Arrow optimization is enabled, previously `toPandas` just failed when Arrow optimization is unable to be used whereas `createDataFrame` from Pandas DataFrame allowed the fallback to non-optimization. Now, both `toPandas` and `createDataFrame` from Pandas DataFrame allow the fallback by default, which can be switched off by `spark.sql.execution.arrow.fallback.enabled`. http://git-wip-us.apache.org/repos/asf/spark/blob/73dd6cf9/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 ---------------------------------------------------------------------- diff --git a/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 b/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 index 9ad6f30..94283f5 100644 --- a/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 +++ b/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 @@ -18,6 +18,12 @@ grammar SqlBase; @members { /** + * When false, INTERSECT is given the greater precedence over the other set + * operations (UNION, EXCEPT and MINUS) as per the SQL standard. + */ + public boolean legacy_setops_precedence_enbled = false; + + /** * Verify whether current token is a valid decimal token (which contains dot). * Returns true if the character that follows the token is not a digit or letter or underscore. * @@ -352,8 +358,13 @@ multiInsertQueryBody ; queryTerm - : queryPrimary #queryTermDefault - | left=queryTerm operator=(INTERSECT | UNION | EXCEPT | SETMINUS) setQuantifier? right=queryTerm #setOperation + : queryPrimary #queryTermDefault + | left=queryTerm {legacy_setops_precedence_enbled}? + operator=(INTERSECT | UNION | EXCEPT | SETMINUS) setQuantifier? right=queryTerm #setOperation + | left=queryTerm {!legacy_setops_precedence_enbled}? + operator=INTERSECT setQuantifier? right=queryTerm #setOperation + | left=queryTerm {!legacy_setops_precedence_enbled}? + operator=(UNION | EXCEPT | SETMINUS) setQuantifier? right=queryTerm #setOperation ; queryPrimary http://git-wip-us.apache.org/repos/asf/spark/blob/73dd6cf9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala ---------------------------------------------------------------------- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala index 9870854..7997e79 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala @@ -356,9 +356,11 @@ package object dsl { def subquery(alias: Symbol): LogicalPlan = SubqueryAlias(alias.name, logicalPlan) - def except(otherPlan: LogicalPlan): LogicalPlan = Except(logicalPlan, otherPlan) + def except(otherPlan: LogicalPlan, isAll: Boolean = false): LogicalPlan = + Except(logicalPlan, otherPlan, isAll) - def intersect(otherPlan: LogicalPlan): LogicalPlan = Intersect(logicalPlan, otherPlan) + def intersect(otherPlan: LogicalPlan, isAll: Boolean = false): LogicalPlan = + Intersect(logicalPlan, otherPlan, isAll) def union(otherPlan: LogicalPlan): LogicalPlan = Union(logicalPlan, otherPlan) http://git-wip-us.apache.org/repos/asf/spark/blob/73dd6cf9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParseDriver.scala ---------------------------------------------------------------------- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParseDriver.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParseDriver.scala index 4c20f23..7d8cb1f 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParseDriver.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParseDriver.scala @@ -84,12 +84,14 @@ abstract class AbstractSqlParser extends ParserInterface with Logging { val lexer = new SqlBaseLexer(new UpperCaseCharStream(CharStreams.fromString(command))) lexer.removeErrorListeners() lexer.addErrorListener(ParseErrorListener) + lexer.legacy_setops_precedence_enbled = SQLConf.get.setOpsPrecedenceEnforced val tokenStream = new CommonTokenStream(lexer) val parser = new SqlBaseParser(tokenStream) parser.addParseListener(PostProcessor) parser.removeErrorListeners() parser.addErrorListener(ParseErrorListener) + parser.legacy_setops_precedence_enbled = SQLConf.get.setOpsPrecedenceEnforced try { try { http://git-wip-us.apache.org/repos/asf/spark/blob/73dd6cf9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ---------------------------------------------------------------------- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala index 68413d7..d7dbdb3 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala @@ -165,9 +165,9 @@ object SetOperation { } case class Intersect( - left: LogicalPlan, - right: LogicalPlan, - isAll: Boolean = false) extends SetOperation(left, right) { + left: LogicalPlan, + right: LogicalPlan, + isAll: Boolean = false) extends SetOperation(left, right) { override def nodeName: String = getClass.getSimpleName + ( if ( isAll ) "All" else "" ) http://git-wip-us.apache.org/repos/asf/spark/blob/73dd6cf9/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---------------------------------------------------------------------- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala index 2aba2f7..67c3abb 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala @@ -1466,6 +1466,16 @@ object SQLConf { .intConf .checkValues((1 to 9).toSet + Deflater.DEFAULT_COMPRESSION) .createWithDefault(Deflater.DEFAULT_COMPRESSION) + + val LEGACY_SETOPS_PRECEDENCE_ENABLED = + buildConf("spark.sql.legacy.setopsPrecedence.enabled") + .internal() + .doc("When set to true and the order of evaluation is not specified by parentheses, the " + + "set operations are performed from left to right as they appear in the query. When set " + + "to false and order of evaluation is not specified by parentheses, INTERSECT operations " + + "are performed before any UNION, EXCEPT and MINUS operations.") + .booleanConf + .createWithDefault(false) } /** @@ -1861,6 +1871,8 @@ class SQLConf extends Serializable with Logging { def avroDeflateLevel: Int = getConf(SQLConf.AVRO_DEFLATE_LEVEL) + def setOpsPrecedenceEnforced: Boolean = getConf(SQLConf.LEGACY_SETOPS_PRECEDENCE_ENABLED) + /** ********************** SQLConf functionality methods ************ */ /** Set Spark SQL configuration properties. */ http://git-wip-us.apache.org/repos/asf/spark/blob/73dd6cf9/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/PlanParserSuite.scala ---------------------------------------------------------------------- diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/PlanParserSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/PlanParserSuite.scala index 9be0ec5..38efd89 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/PlanParserSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/PlanParserSuite.scala @@ -22,6 +22,7 @@ import org.apache.spark.sql.catalyst.analysis.{AnalysisTest, UnresolvedAttribute import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.plans._ import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types.IntegerType /** @@ -676,4 +677,48 @@ class PlanParserSuite extends AnalysisTest { OneRowRelation().select('rtrim.function("c&^,.", "bc...,,,&&&ccc")) ) } + + test("precedence of set operations") { + val a = table("a").select(star()) + val b = table("b").select(star()) + val c = table("c").select(star()) + val d = table("d").select(star()) + + val query1 = + """ + |SELECT * FROM a + |UNION + |SELECT * FROM b + |EXCEPT + |SELECT * FROM c + |INTERSECT + |SELECT * FROM d + """.stripMargin + + val query2 = + """ + |SELECT * FROM a + |UNION + |SELECT * FROM b + |EXCEPT ALL + |SELECT * FROM c + |INTERSECT ALL + |SELECT * FROM d + """.stripMargin + + assertEqual(query1, Distinct(a.union(b)).except(c.intersect(d))) + assertEqual(query2, Distinct(a.union(b)).except(c.intersect(d, isAll = true), isAll = true)) + + // Now disable precedence enforcement to verify the old behaviour. + withSQLConf(SQLConf.LEGACY_SETOPS_PRECEDENCE_ENABLED.key -> "true") { + assertEqual(query1, Distinct(a.union(b)).except(c).intersect(d)) + assertEqual(query2, Distinct(a.union(b)).except(c, isAll = true).intersect(d, isAll = true)) + } + + // Explicitly enable the precedence enforcement + withSQLConf(SQLConf.LEGACY_SETOPS_PRECEDENCE_ENABLED.key -> "false") { + assertEqual(query1, Distinct(a.union(b)).except(c.intersect(d))) + assertEqual(query2, Distinct(a.union(b)).except(c.intersect(d, isAll = true), isAll = true)) + } + } } http://git-wip-us.apache.org/repos/asf/spark/blob/73dd6cf9/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---------------------------------------------------------------------- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala index 75eff8a..b4179f4 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala @@ -535,14 +535,14 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan] { case logical.Intersect(left, right, true) => throw new IllegalStateException( "logical intersect operator should have been replaced by union, aggregate" + - "and generate operators in the optimizer") + " and generate operators in the optimizer") case logical.Except(left, right, false) => throw new IllegalStateException( "logical except operator should have been replaced by anti-join in the optimizer") case logical.Except(left, right, true) => throw new IllegalStateException( "logical except (all) operator should have been replaced by union, aggregate" + - "and generate operators in the optimizer") + " and generate operators in the optimizer") case logical.DeserializeToObject(deserializer, objAttr, child) => execution.DeserializeToObjectExec(deserializer, objAttr, planLater(child)) :: Nil http://git-wip-us.apache.org/repos/asf/spark/blob/73dd6cf9/sql/core/src/test/resources/sql-tests/inputs/intersect-all.sql ---------------------------------------------------------------------- diff --git a/sql/core/src/test/resources/sql-tests/inputs/intersect-all.sql b/sql/core/src/test/resources/sql-tests/inputs/intersect-all.sql index ff4395c..b0b2244 100644 --- a/sql/core/src/test/resources/sql-tests/inputs/intersect-all.sql +++ b/sql/core/src/test/resources/sql-tests/inputs/intersect-all.sql @@ -59,29 +59,40 @@ INTERSECT ALL SELECT * FROM tab2; -- Chain of different `set operations --- We need to parenthesize the following two queries to enforce --- certain order of evaluation of operators. After fix to --- SPARK-24966 this can be removed. SELECT * FROM tab1 EXCEPT SELECT * FROM tab2 UNION ALL -( SELECT * FROM tab1 INTERSECT ALL SELECT * FROM tab2 -); +; -- Chain of different `set operations SELECT * FROM tab1 EXCEPT SELECT * FROM tab2 EXCEPT -( SELECT * FROM tab1 INTERSECT ALL SELECT * FROM tab2 -); +; + +-- test use parenthesis to control order of evaluation +( + ( + ( + SELECT * FROM tab1 + EXCEPT + SELECT * FROM tab2 + ) + EXCEPT + SELECT * FROM tab1 + ) + INTERSECT ALL + SELECT * FROM tab2 +) +; -- Join under intersect all SELECT * @@ -118,6 +129,32 @@ SELECT v FROM tab1 GROUP BY v INTERSECT ALL SELECT k FROM tab2 GROUP BY k; +-- Test pre spark2.4 behaviour of set operation precedence +-- All the set operators are given equal precedence and are evaluated +-- from left to right as they appear in the query. + +-- Set the property +SET spark.sql.legacy.setopsPrecedence.enabled= true; + +SELECT * FROM tab1 +EXCEPT +SELECT * FROM tab2 +UNION ALL +SELECT * FROM tab1 +INTERSECT ALL +SELECT * FROM tab2; + +SELECT * FROM tab1 +EXCEPT +SELECT * FROM tab2 +UNION ALL +SELECT * FROM tab1 +INTERSECT +SELECT * FROM tab2; + +-- Restore the property +SET spark.sql.legacy.setopsPrecedence.enabled = false; + -- Clean-up DROP VIEW IF EXISTS tab1; DROP VIEW IF EXISTS tab2; http://git-wip-us.apache.org/repos/asf/spark/blob/73dd6cf9/sql/core/src/test/resources/sql-tests/results/intersect-all.sql.out ---------------------------------------------------------------------- diff --git a/sql/core/src/test/resources/sql-tests/results/intersect-all.sql.out b/sql/core/src/test/resources/sql-tests/results/intersect-all.sql.out index 792791b..63dd56c 100644 --- a/sql/core/src/test/resources/sql-tests/results/intersect-all.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/intersect-all.sql.out @@ -1,5 +1,5 @@ -- Automatically generated by SQLQueryTestSuite --- Number of queries: 17 +-- Number of queries: 22 -- !query 0 @@ -133,11 +133,9 @@ SELECT * FROM tab1 EXCEPT SELECT * FROM tab2 UNION ALL -( SELECT * FROM tab1 INTERSECT ALL SELECT * FROM tab2 -) -- !query 10 schema struct<k:int,v:int> -- !query 10 output @@ -154,11 +152,9 @@ SELECT * FROM tab1 EXCEPT SELECT * FROM tab2 EXCEPT -( SELECT * FROM tab1 INTERSECT ALL SELECT * FROM tab2 -) -- !query 11 schema struct<k:int,v:int> -- !query 11 output @@ -166,6 +162,26 @@ struct<k:int,v:int> -- !query 12 +( + ( + ( + SELECT * FROM tab1 + EXCEPT + SELECT * FROM tab2 + ) + EXCEPT + SELECT * FROM tab1 + ) + INTERSECT ALL + SELECT * FROM tab2 +) +-- !query 12 schema +struct<k:int,v:int> +-- !query 12 output + + + +-- !query 13 SELECT * FROM (SELECT tab1.k, tab2.v @@ -179,9 +195,9 @@ FROM (SELECT tab1.k, FROM tab1 JOIN tab2 ON tab1.k = tab2.k) --- !query 12 schema +-- !query 13 schema struct<k:int,v:int> --- !query 12 output +-- !query 13 output 1 2 1 2 1 2 @@ -193,7 +209,7 @@ struct<k:int,v:int> 2 3 --- !query 13 +-- !query 14 SELECT * FROM (SELECT tab1.k, tab2.v @@ -207,35 +223,85 @@ FROM (SELECT tab2.v AS k, FROM tab1 JOIN tab2 ON tab1.k = tab2.k) --- !query 13 schema +-- !query 14 schema struct<k:int,v:int> --- !query 13 output +-- !query 14 output --- !query 14 +-- !query 15 SELECT v FROM tab1 GROUP BY v INTERSECT ALL SELECT k FROM tab2 GROUP BY k --- !query 14 schema +-- !query 15 schema struct<v:int> --- !query 14 output +-- !query 15 output 2 3 NULL --- !query 15 +-- !query 16 +SET spark.sql.legacy.setopsPrecedence.enabled= true +-- !query 16 schema +struct<key:string,value:string> +-- !query 16 output +spark.sql.legacy.setopsPrecedence.enabled true + + +-- !query 17 +SELECT * FROM tab1 +EXCEPT +SELECT * FROM tab2 +UNION ALL +SELECT * FROM tab1 +INTERSECT ALL +SELECT * FROM tab2 +-- !query 17 schema +struct<k:int,v:int> +-- !query 17 output +1 2 +1 2 +2 3 +NULL NULL +NULL NULL + + +-- !query 18 +SELECT * FROM tab1 +EXCEPT +SELECT * FROM tab2 +UNION ALL +SELECT * FROM tab1 +INTERSECT +SELECT * FROM tab2 +-- !query 18 schema +struct<k:int,v:int> +-- !query 18 output +1 2 +2 3 +NULL NULL + + +-- !query 19 +SET spark.sql.legacy.setopsPrecedence.enabled = false +-- !query 19 schema +struct<key:string,value:string> +-- !query 19 output +spark.sql.legacy.setopsPrecedence.enabled false + + +-- !query 20 DROP VIEW IF EXISTS tab1 --- !query 15 schema +-- !query 20 schema struct<> --- !query 15 output +-- !query 20 output --- !query 16 +-- !query 21 DROP VIEW IF EXISTS tab2 --- !query 16 schema +-- !query 21 schema struct<> --- !query 16 output +-- !query 21 output --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org