[spark] branch branch-3.0 updated: [SPARK-28228][SQL] Change the default behavior for name conflict in nested WITH clause

dongjoon Sat, 08 Feb 2020 14:14:55 -0800

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-3.0 by this push:
     new 73eba31  [SPARK-28228][SQL] Change the default behavior for name 
conflict in nested WITH clause
73eba31 is described below

commit 73eba319f2ec548fd655a780ee3d240b24b6276d
Author: Yuanjian Li <[email protected]>
AuthorDate: Sat Feb 8 14:10:28 2020 -0800

    [SPARK-28228][SQL] Change the default behavior for name conflict in nested 
WITH clause
    
    ### What changes were proposed in this pull request?
    This is a follow-up for #25029, in this PR we throw an AnalysisException 
when name conflict is detected in nested WITH clause. In this way, the config 
`spark.sql.legacy.ctePrecedence.enabled` should be set explicitly for the 
expected behavior.
    
    ### Why are the changes needed?
    The original change might risky to end-users, it changes behavior silently.
    
    ### Does this PR introduce any user-facing change?
    Yes, change the config `spark.sql.legacy.ctePrecedence.enabled` as optional.
    
    ### How was this patch tested?
    New UT.
    
    Closes #27454 from xuanyuanking/SPARK-28228-follow.
    
    Authored-by: Yuanjian Li <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    (cherry picked from commit 3db3e39f1122350f55f305bee049363621c5894d)
    Signed-off-by: Dongjoon Hyun <[email protected]>
---
 docs/sql-migration-guide.md                        |  2 +-
 .../sql/catalyst/analysis/CTESubstitution.scala    | 49 +++++++++++++++++++++-
 .../org/apache/spark/sql/internal/SQLConf.scala    |  6 ++-
 .../resources/sql-tests/inputs/cte-nonlegacy.sql   |  2 +
 .../results/{cte.sql.out => cte-nonlegacy.sql.out} |  0
 .../test/resources/sql-tests/results/cte.sql.out   | 30 +++++++------
 6 files changed, 72 insertions(+), 17 deletions(-)

diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index 5a5e802..be0fe32 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -101,7 +101,7 @@ license: |
 
   - Since Spark 3.0, if files or subdirectories disappear during recursive 
directory listing (i.e. they appear in an intermediate listing but then cannot 
be read or listed during later phases of the recursive directory listing, due 
to either concurrent file deletions or object store consistency issues) then 
the listing will fail with an exception unless 
`spark.sql.files.ignoreMissingFiles` is `true` (default `false`). In previous 
versions, these missing files or subdirectories would be i [...]
 
-  - Since Spark 3.0, substitution order of nested WITH clauses is changed and 
an inner CTE definition takes precedence over an outer. In version 2.4 and 
earlier, `WITH t AS (SELECT 1), t2 AS (WITH t AS (SELECT 2) SELECT * FROM t) 
SELECT * FROM t2` returns `1` while in version 3.0 it returns `2`. The previous 
behaviour can be restored by setting `spark.sql.legacy.ctePrecedence.enabled` 
to `true`.
+  - Since Spark 3.0, Spark throws an AnalysisException if name conflict is 
detected in the nested WITH clause by default. It forces the users to choose 
the specific substitution order they wanted, which is controlled by 
`spark.sql.legacy.ctePrecedence.enabled`. If set to false (which is 
recommended), inner CTE definitions take precedence over outer definitions. For 
example, set the config to `false`, `WITH t AS (SELECT 1), t2 AS (WITH t AS 
(SELECT 2) SELECT * FROM t) SELECT * FROM t2` re [...]
 
   - Since Spark 3.0, the `add_months` function does not adjust the resulting 
date to a last day of month if the original date is a last day of months. For 
example, `select add_months(DATE'2019-02-28', 1)` results `2019-03-28`. In 
Spark version 2.4 and earlier, the resulting date is adjusted when the original 
date is a last day of months. For example, adding a month to `2019-02-28` 
results in `2019-03-31`.
 
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala
index 60e6bf8..d2be15d 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala
@@ -17,6 +17,7 @@
 
 package org.apache.spark.sql.catalyst.analysis
 
+import org.apache.spark.sql.AnalysisException
 import org.apache.spark.sql.catalyst.expressions.SubqueryExpression
 import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, With}
 import org.apache.spark.sql.catalyst.rules.Rule
@@ -28,10 +29,54 @@ import 
org.apache.spark.sql.internal.SQLConf.LEGACY_CTE_PRECEDENCE_ENABLED
  */
 object CTESubstitution extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = {
-    if (SQLConf.get.getConf(LEGACY_CTE_PRECEDENCE_ENABLED)) {
+    val isLegacy = SQLConf.get.getConf(LEGACY_CTE_PRECEDENCE_ENABLED)
+    if (isLegacy.isEmpty) {
+      assertNoNameConflictsInCTE(plan, inTraverse = false)
+      traverseAndSubstituteCTE(plan, inTraverse = false)
+    } else if (isLegacy.get) {
       legacyTraverseAndSubstituteCTE(plan)
     } else {
-      traverseAndSubstituteCTE(plan, false)
+      traverseAndSubstituteCTE(plan, inTraverse = false)
+    }
+  }
+
+  /**
+   * Check the plan to be traversed has naming conflicts in nested CTE or not, 
traverse through
+   * child, innerChildren and subquery for the current plan.
+   */
+  private def assertNoNameConflictsInCTE(
+      plan: LogicalPlan,
+      inTraverse: Boolean,
+      cteNames: Set[String] = Set.empty): Unit = {
+    plan.foreach {
+      case w @ With(child, relations) =>
+        val newNames = relations.map {
+          case (cteName, _) =>
+            if (cteNames.contains(cteName)) {
+              throw new AnalysisException(s"Name $cteName is ambiguous in 
nested CTE. " +
+                s"Please set ${LEGACY_CTE_PRECEDENCE_ENABLED.key} to false so 
that name defined " +
+                "in inner CTE takes precedence. See more details in 
SPARK-28228.")
+            } else {
+              cteName
+            }
+        }.toSet
+        child.transformExpressions {
+          case e: SubqueryExpression =>
+            assertNoNameConflictsInCTE(e.plan, inTraverse = true, cteNames ++ 
newNames)
+            e
+        }
+        w.innerChildren.foreach { p =>
+          assertNoNameConflictsInCTE(p, inTraverse = true, cteNames ++ 
newNames)
+        }
+
+      case other if inTraverse =>
+        other.transformExpressions {
+          case e: SubqueryExpression =>
+            assertNoNameConflictsInCTE(e.plan, inTraverse = true, cteNames)
+            e
+        }
+
+      case _ =>
     }
   }
 
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index bed8410..a72bd53 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -2098,9 +2098,11 @@ object SQLConf {
 
   val LEGACY_CTE_PRECEDENCE_ENABLED = 
buildConf("spark.sql.legacy.ctePrecedence.enabled")
     .internal()
-    .doc("When true, outer CTE definitions takes precedence over inner 
definitions.")
+    .doc("When true, outer CTE definitions takes precedence over inner 
definitions. If set to " +
+      "false, inner CTE definitions take precedence. The default value is 
empty, " +
+      "AnalysisException is thrown while name conflict is detected in nested 
CTE.")
     .booleanConf
-    .createWithDefault(false)
+    .createOptional
 
   val LEGACY_ARRAY_EXISTS_FOLLOWS_THREE_VALUED_LOGIC =
     buildConf("spark.sql.legacy.arrayExistsFollowsThreeValuedLogic")
diff --git a/sql/core/src/test/resources/sql-tests/inputs/cte-nonlegacy.sql 
b/sql/core/src/test/resources/sql-tests/inputs/cte-nonlegacy.sql
new file mode 100644
index 0000000..b711bf3
--- /dev/null
+++ b/sql/core/src/test/resources/sql-tests/inputs/cte-nonlegacy.sql
@@ -0,0 +1,2 @@
+--SET spark.sql.legacy.ctePrecedence.enabled = false
+--IMPORT cte.sql
diff --git a/sql/core/src/test/resources/sql-tests/results/cte.sql.out 
b/sql/core/src/test/resources/sql-tests/results/cte-nonlegacy.sql.out
similarity index 100%
copy from sql/core/src/test/resources/sql-tests/results/cte.sql.out
copy to sql/core/src/test/resources/sql-tests/results/cte-nonlegacy.sql.out
diff --git a/sql/core/src/test/resources/sql-tests/results/cte.sql.out 
b/sql/core/src/test/resources/sql-tests/results/cte.sql.out
index 2d87781..1d50aa8 100644
--- a/sql/core/src/test/resources/sql-tests/results/cte.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/cte.sql.out
@@ -204,9 +204,10 @@ WITH
   )
 SELECT * FROM t2
 -- !query schema
-struct<2:int>
+struct<>
 -- !query output
-2
+org.apache.spark.sql.AnalysisException
+Name t is ambiguous in nested CTE. Please set 
spark.sql.legacy.ctePrecedence.enabled to false so that name defined in inner 
CTE takes precedence. See more details in SPARK-28228.;
 
 
 -- !query
@@ -222,9 +223,10 @@ WITH
   )
 SELECT * FROM t2
 -- !query schema
-struct<scalarsubquery():int>
+struct<>
 -- !query output
-2
+org.apache.spark.sql.AnalysisException
+Name t is ambiguous in nested CTE. Please set 
spark.sql.legacy.ctePrecedence.enabled to false so that name defined in inner 
CTE takes precedence. See more details in SPARK-28228.;
 
 
 -- !query
@@ -240,9 +242,10 @@ WITH
   )
 SELECT * FROM t2
 -- !query schema
-struct<3:int>
+struct<>
 -- !query output
-3
+org.apache.spark.sql.AnalysisException
+Name t is ambiguous in nested CTE. Please set 
spark.sql.legacy.ctePrecedence.enabled to false so that name defined in inner 
CTE takes precedence. See more details in SPARK-28228.;
 
 
 -- !query
@@ -293,9 +296,10 @@ SELECT (
   SELECT * FROM t
 )
 -- !query schema
-struct<scalarsubquery():int>
+struct<>
 -- !query output
-2
+org.apache.spark.sql.AnalysisException
+Name t is ambiguous in nested CTE. Please set 
spark.sql.legacy.ctePrecedence.enabled to false so that name defined in inner 
CTE takes precedence. See more details in SPARK-28228.;
 
 
 -- !query
@@ -307,9 +311,10 @@ SELECT (
   )
 )
 -- !query schema
-struct<scalarsubquery():int>
+struct<>
 -- !query output
-2
+org.apache.spark.sql.AnalysisException
+Name t is ambiguous in nested CTE. Please set 
spark.sql.legacy.ctePrecedence.enabled to false so that name defined in inner 
CTE takes precedence. See more details in SPARK-28228.;
 
 
 -- !query
@@ -322,9 +327,10 @@ SELECT (
   )
 )
 -- !query schema
-struct<scalarsubquery():int>
+struct<>
 -- !query output
-3
+org.apache.spark.sql.AnalysisException
+Name t is ambiguous in nested CTE. Please set 
spark.sql.legacy.ctePrecedence.enabled to false so that name defined in inner 
CTE takes precedence. See more details in SPARK-28228.;
 
 
 -- !query


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[spark] branch branch-3.0 updated: [SPARK-28228][SQL] Change the default behavior for name conflict in nested WITH clause

Reply via email to