[spark] branch branch-3.0 updated: [SPARK-29462][SQL] The data type of "array()" should be array

gurwls223 Tue, 11 Feb 2020 00:24:18 -0800

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-3.0 by this push:
     new 1e5766c  [SPARK-29462][SQL] The data type of "array()" should be 
array<null>
1e5766c is described below

commit 1e5766cbdd69080a7fc3881636406945fbd85752
Author: HyukjinKwon <[email protected]>
AuthorDate: Tue Feb 11 17:22:08 2020 +0900

    [SPARK-29462][SQL] The data type of "array()" should be array<null>
    
    ### What changes were proposed in this pull request?
    
    This brings https://github.com/apache/spark/pull/26324 back. It was 
reverted basically because, firstly Hive compatibility, and the lack of 
investigations in other DBMSes and ANSI.
    
    - In case of PostgreSQL seems coercing NULL literal to TEXT type.
    - Presto seems coercing `array() + array(1)` -> array of int.
    - Hive seems  `array() + array(1)` -> array of strings
    
     Given that, the design choices have been differently made for some 
reasons. If we pick one of both, seems coercing to array of int makes much more 
sense.
    
    Another investigation was made offline internally. Seems ANSI SQL 2011, 
section 6.5 "<contextually typed value specification>" states:
    
    > If ES is specified, then let ET be the element type determined by the 
context in which ES appears. The declared type DT of ES is Case:
    >
    > a) If ES simply contains ARRAY, then ET ARRAY[0].
    >
    > b) If ES simply contains MULTISET, then ET MULTISET.
    >
    > ES is effectively replaced by CAST ( ES AS DT )
    
    From reading other related context, doing it to `NullType`. Given the 
investigation made, choosing to `null` seems correct, and we have a reference 
Presto now. Therefore, this PR proposes to bring it back.
    
    ### Why are the changes needed?
    When empty array is created, it should be declared as array<null>.
    
    ### Does this PR introduce any user-facing change?
    Yes, `array()` creates `array<null>`. Now `array(1) + array()` can 
correctly create `array(1)` instead of `array("1")`.
    
    ### How was this patch tested?
    Tested manually
    
    Closes #27521 from HyukjinKwon/SPARK-29462.
    
    Lead-authored-by: HyukjinKwon <[email protected]>
    Co-authored-by: Aman Omer <[email protected]>
    Signed-off-by: HyukjinKwon <[email protected]>
    (cherry picked from commit 0045be766b949dff23ed72bd559568f17f645ffe)
    Signed-off-by: HyukjinKwon <[email protected]>
---
 docs/sql-migration-guide.md                             |  2 ++
 .../sql/catalyst/expressions/complexTypeCreator.scala   | 11 ++++++++++-
 .../scala/org/apache/spark/sql/internal/SQLConf.scala   |  9 +++++++++
 .../org/apache/spark/sql/DataFrameFunctionsSuite.scala  | 17 +++++++++++++----
 4 files changed, 34 insertions(+), 5 deletions(-)

diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index 26eb583..f98fab5 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -215,6 +215,8 @@ license: |
   For example `SELECT timestamp 'tomorrow';`.
 
   - Since Spark 3.0, the `size` function returns `NULL` for the `NULL` input. 
In Spark version 2.4 and earlier, this function gives `-1` for the same input. 
To restore the behavior before Spark 3.0, you can set 
`spark.sql.legacy.sizeOfNull` to `true`.
+  
+  - Since Spark 3.0, when the `array` function is called without any 
parameters, it returns an empty array of `NullType`. In Spark version 2.4 and 
earlier, it returns an empty array of string type. To restore the behavior 
before Spark 3.0, you can set 
`spark.sql.legacy.arrayDefaultToStringType.enabled` to `true`.
 
   - Since Spark 3.0, the interval literal syntax does not allow multiple 
from-to units anymore. For example, `SELECT INTERVAL '1-1' YEAR TO MONTH '2-2' 
YEAR TO MONTH'` throws parser exception.
 
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
index 9ce87a4..7335e30 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
@@ -23,6 +23,7 @@ import 
org.apache.spark.sql.catalyst.analysis.FunctionRegistry.FunctionBuilder
 import org.apache.spark.sql.catalyst.expressions.codegen._
 import org.apache.spark.sql.catalyst.expressions.codegen.Block._
 import org.apache.spark.sql.catalyst.util._
+import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types._
 import org.apache.spark.unsafe.types.UTF8String
 
@@ -44,10 +45,18 @@ case class CreateArray(children: Seq[Expression]) extends 
Expression {
     TypeUtils.checkForSameTypeInputExpr(children.map(_.dataType), s"function 
$prettyName")
   }
 
+  private val defaultElementType: DataType = {
+    if (SQLConf.get.getConf(SQLConf.LEGACY_ARRAY_DEFAULT_TO_STRING)) {
+      StringType
+    } else {
+      NullType
+    }
+  }
+
   override def dataType: ArrayType = {
     ArrayType(
       
TypeCoercion.findCommonTypeDifferentOnlyInNullFlags(children.map(_.dataType))
-        .getOrElse(StringType),
+        .getOrElse(defaultElementType),
       containsNull = children.exists(_.nullable))
   }
 
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index e38fe76..b79b767 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -2007,6 +2007,15 @@ object SQLConf {
       .booleanConf
       .createWithDefault(false)
 
+  val LEGACY_ARRAY_DEFAULT_TO_STRING =
+    buildConf("spark.sql.legacy.arrayDefaultToStringType.enabled")
+      .internal()
+      .doc("When set to true, it returns an empty array of string type when 
the `array` " +
+        "function is called without any parameters. Otherwise, it returns an 
empty " +
+        "array of `NullType`")
+      .booleanConf
+      .createWithDefault(false)
+
   val TRUNCATE_TABLE_IGNORE_PERMISSION_ACL =
     buildConf("spark.sql.truncateTable.ignorePermissionAcl.enabled")
       .internal()
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala
index 7fce036..9e9d8c3 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala
@@ -3499,12 +3499,9 @@ class DataFrameFunctionsSuite extends QueryTest with 
SharedSparkSession {
     ).foreach(assertValuesDoNotChangeAfterCoalesceOrUnion(_))
   }
 
-  test("SPARK-21281 use string types by default if array and map have no 
argument") {
+  test("SPARK-21281 use string types by default if map have no argument") {
     val ds = spark.range(1)
     var expectedSchema = new StructType()
-      .add("x", ArrayType(StringType, containsNull = false), nullable = false)
-    assert(ds.select(array().as("x")).schema == expectedSchema)
-    expectedSchema = new StructType()
       .add("x", MapType(StringType, StringType, valueContainsNull = false), 
nullable = false)
     assert(ds.select(map().as("x")).schema == expectedSchema)
   }
@@ -3577,6 +3574,18 @@ class DataFrameFunctionsSuite extends QueryTest with 
SharedSparkSession {
     }.getMessage
     assert(nonFoldableError.contains("The 'escape' parameter must be a string 
literal"))
   }
+
+  test("SPARK-29462: Empty array of NullType for array function with no 
arguments") {
+    Seq((true, StringType), (false, NullType)).foreach {
+      case (arrayDefaultToString, expectedType) =>
+        withSQLConf(SQLConf.LEGACY_ARRAY_DEFAULT_TO_STRING.key -> 
arrayDefaultToString.toString) {
+          val schema = spark.range(1).select(array()).schema
+          assert(schema.nonEmpty && 
schema.head.dataType.isInstanceOf[ArrayType])
+          val actualType = 
schema.head.dataType.asInstanceOf[ArrayType].elementType
+          assert(actualType === expectedType)
+        }
+    }
+  }
 }
 
 object DataFrameFunctionsSuite {


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[spark] branch branch-3.0 updated: [SPARK-29462][SQL] The data type of "array()" should be array

Reply via email to