This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-3.0 by this push:
new 1e5766c [SPARK-29462][SQL] The data type of "array()" should be
array<null>
1e5766c is described below
commit 1e5766cbdd69080a7fc3881636406945fbd85752
Author: HyukjinKwon <[email protected]>
AuthorDate: Tue Feb 11 17:22:08 2020 +0900
[SPARK-29462][SQL] The data type of "array()" should be array<null>
### What changes were proposed in this pull request?
This brings https://github.com/apache/spark/pull/26324 back. It was
reverted basically because, firstly Hive compatibility, and the lack of
investigations in other DBMSes and ANSI.
- In case of PostgreSQL seems coercing NULL literal to TEXT type.
- Presto seems coercing `array() + array(1)` -> array of int.
- Hive seems `array() + array(1)` -> array of strings
Given that, the design choices have been differently made for some
reasons. If we pick one of both, seems coercing to array of int makes much more
sense.
Another investigation was made offline internally. Seems ANSI SQL 2011,
section 6.5 "<contextually typed value specification>" states:
> If ES is specified, then let ET be the element type determined by the
context in which ES appears. The declared type DT of ES is Case:
>
> a) If ES simply contains ARRAY, then ET ARRAY[0].
>
> b) If ES simply contains MULTISET, then ET MULTISET.
>
> ES is effectively replaced by CAST ( ES AS DT )
From reading other related context, doing it to `NullType`. Given the
investigation made, choosing to `null` seems correct, and we have a reference
Presto now. Therefore, this PR proposes to bring it back.
### Why are the changes needed?
When empty array is created, it should be declared as array<null>.
### Does this PR introduce any user-facing change?
Yes, `array()` creates `array<null>`. Now `array(1) + array()` can
correctly create `array(1)` instead of `array("1")`.
### How was this patch tested?
Tested manually
Closes #27521 from HyukjinKwon/SPARK-29462.
Lead-authored-by: HyukjinKwon <[email protected]>
Co-authored-by: Aman Omer <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 0045be766b949dff23ed72bd559568f17f645ffe)
Signed-off-by: HyukjinKwon <[email protected]>
---
docs/sql-migration-guide.md | 2 ++
.../sql/catalyst/expressions/complexTypeCreator.scala | 11 ++++++++++-
.../scala/org/apache/spark/sql/internal/SQLConf.scala | 9 +++++++++
.../org/apache/spark/sql/DataFrameFunctionsSuite.scala | 17 +++++++++++++----
4 files changed, 34 insertions(+), 5 deletions(-)
diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index 26eb583..f98fab5 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -215,6 +215,8 @@ license: |
For example `SELECT timestamp 'tomorrow';`.
- Since Spark 3.0, the `size` function returns `NULL` for the `NULL` input.
In Spark version 2.4 and earlier, this function gives `-1` for the same input.
To restore the behavior before Spark 3.0, you can set
`spark.sql.legacy.sizeOfNull` to `true`.
+
+ - Since Spark 3.0, when the `array` function is called without any
parameters, it returns an empty array of `NullType`. In Spark version 2.4 and
earlier, it returns an empty array of string type. To restore the behavior
before Spark 3.0, you can set
`spark.sql.legacy.arrayDefaultToStringType.enabled` to `true`.
- Since Spark 3.0, the interval literal syntax does not allow multiple
from-to units anymore. For example, `SELECT INTERVAL '1-1' YEAR TO MONTH '2-2'
YEAR TO MONTH'` throws parser exception.
diff --git
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
index 9ce87a4..7335e30 100644
---
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
+++
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
@@ -23,6 +23,7 @@ import
org.apache.spark.sql.catalyst.analysis.FunctionRegistry.FunctionBuilder
import org.apache.spark.sql.catalyst.expressions.codegen._
import org.apache.spark.sql.catalyst.expressions.codegen.Block._
import org.apache.spark.sql.catalyst.util._
+import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types._
import org.apache.spark.unsafe.types.UTF8String
@@ -44,10 +45,18 @@ case class CreateArray(children: Seq[Expression]) extends
Expression {
TypeUtils.checkForSameTypeInputExpr(children.map(_.dataType), s"function
$prettyName")
}
+ private val defaultElementType: DataType = {
+ if (SQLConf.get.getConf(SQLConf.LEGACY_ARRAY_DEFAULT_TO_STRING)) {
+ StringType
+ } else {
+ NullType
+ }
+ }
+
override def dataType: ArrayType = {
ArrayType(
TypeCoercion.findCommonTypeDifferentOnlyInNullFlags(children.map(_.dataType))
- .getOrElse(StringType),
+ .getOrElse(defaultElementType),
containsNull = children.exists(_.nullable))
}
diff --git
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index e38fe76..b79b767 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -2007,6 +2007,15 @@ object SQLConf {
.booleanConf
.createWithDefault(false)
+ val LEGACY_ARRAY_DEFAULT_TO_STRING =
+ buildConf("spark.sql.legacy.arrayDefaultToStringType.enabled")
+ .internal()
+ .doc("When set to true, it returns an empty array of string type when
the `array` " +
+ "function is called without any parameters. Otherwise, it returns an
empty " +
+ "array of `NullType`")
+ .booleanConf
+ .createWithDefault(false)
+
val TRUNCATE_TABLE_IGNORE_PERMISSION_ACL =
buildConf("spark.sql.truncateTable.ignorePermissionAcl.enabled")
.internal()
diff --git
a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala
b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala
index 7fce036..9e9d8c3 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala
@@ -3499,12 +3499,9 @@ class DataFrameFunctionsSuite extends QueryTest with
SharedSparkSession {
).foreach(assertValuesDoNotChangeAfterCoalesceOrUnion(_))
}
- test("SPARK-21281 use string types by default if array and map have no
argument") {
+ test("SPARK-21281 use string types by default if map have no argument") {
val ds = spark.range(1)
var expectedSchema = new StructType()
- .add("x", ArrayType(StringType, containsNull = false), nullable = false)
- assert(ds.select(array().as("x")).schema == expectedSchema)
- expectedSchema = new StructType()
.add("x", MapType(StringType, StringType, valueContainsNull = false),
nullable = false)
assert(ds.select(map().as("x")).schema == expectedSchema)
}
@@ -3577,6 +3574,18 @@ class DataFrameFunctionsSuite extends QueryTest with
SharedSparkSession {
}.getMessage
assert(nonFoldableError.contains("The 'escape' parameter must be a string
literal"))
}
+
+ test("SPARK-29462: Empty array of NullType for array function with no
arguments") {
+ Seq((true, StringType), (false, NullType)).foreach {
+ case (arrayDefaultToString, expectedType) =>
+ withSQLConf(SQLConf.LEGACY_ARRAY_DEFAULT_TO_STRING.key ->
arrayDefaultToString.toString) {
+ val schema = spark.range(1).select(array()).schema
+ assert(schema.nonEmpty &&
schema.head.dataType.isInstanceOf[ArrayType])
+ val actualType =
schema.head.dataType.asInstanceOf[ArrayType].elementType
+ assert(actualType === expectedType)
+ }
+ }
+ }
}
object DataFrameFunctionsSuite {
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]