date:20210802

[spark] branch branch-3.2 updated (de837a0 -> 71e4c56)

2021-08-02 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git.


from de837a0  [SPARK-36331][CORE] Add standard SQLSTATEs to error guidelines
 add 71e4c56  [SPARK-36367][3.2][PYTHON] Partially backport to avoid 
unexpected error with pandas 1.3

No new revisions were added by this update.

Summary of changes:
 python/pyspark/pandas/groupby.py | 18 +++---
 python/pyspark/pandas/series.py  |  4 ++--
 2 files changed, 13 insertions(+), 9 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (63517eb -> 8cb9cf3)

2021-08-02 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 63517eb  [SPARK-36331][CORE] Add standard SQLSTATEs to error guidelines
 add 8cb9cf3  [SPARK-36345][SPARK-36367][INFRA][PYTHON] Disable tests 
failed by the incompatible behavior of pandas 1.3

No new revisions were added by this update.

Summary of changes:
 .github/workflows/build_and_test.yml   |  4 +-
 python/pyspark/pandas/groupby.py   | 18 +++--
 python/pyspark/pandas/series.py|  4 +-
 .../tests/data_type_ops/test_categorical_ops.py|  6 +-
 python/pyspark/pandas/tests/indexes/test_base.py   | 76 +++-
 .../pyspark/pandas/tests/indexes/test_category.py  |  5 +-
 python/pyspark/pandas/tests/test_categorical.py| 82 ++
 python/pyspark/pandas/tests/test_expanding.py  | 51 --
 .../test_ops_on_diff_frames_groupby_expanding.py   | 13 ++--
 .../test_ops_on_diff_frames_groupby_rolling.py | 14 ++--
 python/pyspark/pandas/tests/test_rolling.py| 52 --
 python/pyspark/pandas/tests/test_series.py | 16 +++--
 12 files changed, 227 insertions(+), 114 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated (9eec11b -> de837a0)

2021-08-02 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 9eec11b  [SPARK-36379][SQL] Null at root level of a JSON array should 
not fail w/ permissive mode
 add de837a0  [SPARK-36331][CORE] Add standard SQLSTATEs to error guidelines

No new revisions were added by this update.

Summary of changes:
 core/src/main/resources/error/README.md | 169 +++-
 1 file changed, 165 insertions(+), 4 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (c20af53 -> 63517eb)

2021-08-02 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c20af53  [SPARK-36373][SQL] DecimalPrecision only add necessary cast
 add 63517eb  [SPARK-36331][CORE] Add standard SQLSTATEs to error guidelines

No new revisions were added by this update.

Summary of changes:
 core/src/main/resources/error/README.md | 169 +++-
 1 file changed, 165 insertions(+), 4 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-36373][SQL] DecimalPrecision only add necessary cast

2021-08-02 Thread yumwang

This is an automated email from the ASF dual-hosted git repository.

yumwang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c20af53  [SPARK-36373][SQL] DecimalPrecision only add necessary cast
c20af53 is described below

commit c20af535803a7250fef047c2bf0fe30be242369d
Author: Yuming Wang 
AuthorDate: Tue Aug 3 08:12:54 2021 +0800

[SPARK-36373][SQL] DecimalPrecision only add necessary cast

### What changes were proposed in this pull request?

This pr makes `DecimalPrecision` only add necessary cast similar to 
[`ImplicitTypeCasts`](https://github.com/apache/spark/blob/96c2919988ddf78d104103876d8d8221e8145baa/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala#L675-L678).
 For example:
```
EqualTo(AttributeReference("d1", DecimalType(5, 2))(), 
AttributeReference("d2", DecimalType(2, 1))())
```
It will add a useless cast to _d1_:
```
(cast(d1#6 as decimal(5,2)) = cast(d2#7 as decimal(5,2)))
```

### Why are the changes needed?

1. Avoid adding unnecessary cast. Although it will be removed by  
`SimplifyCasts` later.
2. I'm trying to add an extended rule similar to 
`PullOutGroupingExpressions`. The current behavior will introduce additional 
alias. For example: `cast(d1 as decimal(5,2)) as cast_d1`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #33602 from wangyum/SPARK-36373.

Authored-by: Yuming Wang 
Signed-off-by: Yuming Wang 
---
 .../org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala | 4 +++-
 .../apache/spark/sql/catalyst/analysis/DecimalPrecisionSuite.scala| 3 +++
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala
index bf128cd..b112641 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala
@@ -204,7 +204,9 @@ object DecimalPrecision extends TypeCoercionRule {
 case b @ BinaryComparison(e1 @ DecimalType.Expression(p1, s1),
 e2 @ DecimalType.Expression(p2, s2)) if p1 != p2 || s1 != s2 =>
   val resultType = widerDecimalType(p1, s1, p2, s2)
-  b.makeCopy(Array(Cast(e1, resultType), Cast(e2, resultType)))
+  val newE1 = if (e1.dataType == resultType) e1 else Cast(e1, resultType)
+  val newE2 = if (e2.dataType == resultType) e2 else Cast(e2, resultType)
+  b.makeCopy(Array(newE1, newE2))
   }
 
   /**
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecisionSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecisionSuite.scala
index 834f38a..698d2d5 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecisionSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecisionSuite.scala
@@ -59,6 +59,9 @@ class DecimalPrecisionSuite extends AnalysisTest with 
BeforeAndAfter {
 val comparison = analyzer.execute(plan).collect {
   case Project(Alias(e: BinaryComparison, _) :: Nil, _) => e
 }.head
+// Only add necessary cast.
+assert(comparison.left.children.forall(_.dataType !== expectedType))
+assert(comparison.right.children.forall(_.dataType !== expectedType))
 assert(comparison.left.dataType === expectedType)
 assert(comparison.right.dataType === expectedType)
   }

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (0bbcbc6 -> 7a27f8a)

2021-08-02 Thread viirya

This is an automated email from the ASF dual-hosted git repository.

viirya pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 0bbcbc6  [SPARK-36379][SQL] Null at root level of a JSON array should 
not fail w/ permissive mode
 add 7a27f8a  [SPARK-36137][SQL] HiveShim should fallback to 
getAllPartitionsOf even if directSQL is enabled in remote HMS

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/errors/QueryExecutionErrors.scala|  2 +-
 .../org/apache/spark/sql/internal/SQLConf.scala| 13 +++
 .../apache/spark/sql/hive/client/HiveShim.scala| 24 +++
 .../hive/client/HivePartitionFilteringSuite.scala  | 27 +-
 4 files changed, 49 insertions(+), 17 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] dongjoon-hyun commented on pull request #351: updating local k8s/minikube testing instructions

2021-08-02 Thread GitBox



dongjoon-hyun commented on pull request #351:
URL: https://github.com/apache/spark-website/pull/351#issuecomment-891219026


   +1, LGTM. Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] asfgit closed pull request #351: updating local k8s/minikube testing instructions

2021-08-02 Thread GitBox



asfgit closed pull request #351:
URL: https://github.com/apache/spark-website/pull/351


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark-website] branch asf-site updated: updating local k8s/minikube testing instructions

2021-08-02 Thread shaneknapp

This is an automated email from the ASF dual-hosted git repository.

shaneknapp pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/spark-website.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new d60611b  updating local k8s/minikube testing instructions
d60611b is described below

commit d60611b5c464db0fb99ca072a9d8b55e824ca7c2
Author: shane knapp 
AuthorDate: Mon Aug 2 10:15:46 2021 -0700

updating local k8s/minikube testing instructions

a small update to the k8s/minikube integration test instructions

Author: shane knapp 

Closes #351 from shaneknapp/updating-k8s-docs.
---
 developer-tools.md| 9 ++---
 site/developer-tools.html | 9 ++---
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/developer-tools.md b/developer-tools.md
index 9551533..bf3ee6e 100644
--- a/developer-tools.md
+++ b/developer-tools.md
@@ -169,8 +169,8 @@ Please check other available options via 
`python/run-tests[-with-coverage] --hel
 
 If you have made changes to the K8S bindings in Apache Spark, it would behoove 
you to test locally before submitting a PR.  This is relatively simple to do, 
but it will require a local (to you) installation of 
[minikube](https://kubernetes.io/docs/setup/minikube/).  Due to how minikube 
interacts with the host system, please be sure to set things up as follows:
 
-- minikube version v1.7.3 (or greater)
-- You must use a VM driver!  Running minikube with the `--vm-driver=none` 
option requires that the user launching minikube/k8s have root access.  Our 
Jenkins workers use the [kvm2](https://minikube.sigs.k8s.io/docs/drivers/kvm2/) 
drivers.  More details [here](https://minikube.sigs.k8s.io/docs/drivers/).
+- minikube version v1.18.1 (or greater)
+- You must use a VM driver!  Running minikube with the `--vm-driver=none` 
option requires that the user launching minikube/k8s have root access, which 
could impact how the tests are run.  Our Jenkins workers use the default 
[docker](https://minikube.sigs.k8s.io/docs/drivers/docker/) drivers.  More 
details [here](https://minikube.sigs.k8s.io/docs/drivers/).
 - kubernetes version v1.17.3 (can be set by executing `minikube config set 
kubernetes-version v1.17.3`)
 - the current kubernetes context must be minikube's default context (called 
'minikube'). This can be selected by `minikube kubectl -- config use-context 
minikube`. This is only needed when after minikube is started another 
kubernetes context is selected.
 
@@ -196,8 +196,11 @@ PVC_TMP_DIR=$(mktemp -d)
 export PVC_TESTS_HOST_PATH=$PVC_TMP_DIR
 export PVC_TESTS_VM_PATH=$PVC_TMP_DIR
 
-minikube --vm-driver= start --memory 6000 --cpus 8
 minikube config set kubernetes-version v1.17.3
+minikube --vm-driver= start --memory 6000 --cpus 8
+
+# for macos only (see https://github.com/apache/spark/pull/32793):
+# minikube ssh "sudo useradd spark -u 185 -g 0 -m -s /bin/bash"
 
 minikube mount ${PVC_TESTS_HOST_PATH}:${PVC_TESTS_VM_PATH} 
--9p-version=9p2000.L --gid=0 --uid=185 &; MOUNT_PID=$!
 
diff --git a/site/developer-tools.html b/site/developer-tools.html
index 5558945..81d64c7 100644
--- a/site/developer-tools.html
+++ b/site/developer-tools.html
@@ -348,8 +348,8 @@ Generating HTML files for PySpark coverage under 
/.../spark/python/test_coverage
 If you have made changes to the K8S bindings in Apache Spark, it would 
behoove you to test locally before submitting a PR.  This is relatively simple 
to do, but it will require a local (to you) installation of https://kubernetes.io/docs/setup/minikube/;>minikube.  Due to how 
minikube interacts with the host system, please be sure to set things up as 
follows:
 
 
-  minikube version v1.7.3 (or greater)
-  You must use a VM driver!  Running minikube with the --vm-driver=none option 
requires that the user launching minikube/k8s have root access.  Our Jenkins 
workers use the https://minikube.sigs.k8s.io/docs/drivers/kvm2/;>kvm2 drivers.  More 
details https://minikube.sigs.k8s.io/docs/drivers/;>here.
+  minikube version v1.18.1 (or greater)
+  You must use a VM driver!  Running minikube with the --vm-driver=none option 
requires that the user launching minikube/k8s have root access, which could 
impact how the tests are run.  Our Jenkins workers use the default https://minikube.sigs.k8s.io/docs/drivers/docker/;>docker drivers.  
More details https://minikube.sigs.k8s.io/docs/drivers/;>here.
   kubernetes version v1.17.3 (can be set by executing minikube config set 
kubernetes-version v1.17.3)
   the current kubernetes context must be minikubes default context 
(called minikube). This can be selected by minikube kubectl -- config 
use-context minikube. This is only needed when after minikube is started 
another kubernetes context is selected.
 
@@ -374,8 +374,11 @@ export 
TARBALL_TO_TEST=($(pwd)/spark-*${DATE}-${REVISION}.tgz)
 export PVC_TESTS_HOST_PATH=$PVC_TMP_DIR
 export PVC_TESTS_VM_PATH=$PVC_TMP_DIR
 
-minikube

[GitHub] [spark-website] shaneknapp opened a new pull request #351: updating local k8s/minikube testing instructions

2021-08-02 Thread GitBox



shaneknapp opened a new pull request #351:
URL: https://github.com/apache/spark-website/pull/351


   a small update to the k8s/minikube integration test instructions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated: [SPARK-36379][SQL] Null at root level of a JSON array should not fail w/ permissive mode

2021-08-02 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 9eec11b  [SPARK-36379][SQL] Null at root level of a JSON array should 
not fail w/ permissive mode
9eec11b is described below

commit 9eec11b9567c033b10a6d70340e63d6270b481e8
Author: Hyukjin Kwon 
AuthorDate: Mon Aug 2 10:01:12 2021 -0700

[SPARK-36379][SQL] Null at root level of a JSON array should not fail w/ 
permissive mode

This PR proposes to fail properly so JSON parser can proceed and parse the 
input with the permissive mode.
Previously, we passed `null`s as are, the root `InternalRow`s became 
`null`s, and it causes the query fails even with permissive mode on.
Now, we fail explicitly if `null` is passed when the input array contains 
`null`.

Note that this is consistent with non-array JSON input:

**Permissive mode:**

```scala
spark.read.json(Seq("""{"a": "str"}""", """null""").toDS).collect()
```
```
res0: Array[org.apache.spark.sql.Row] = Array([str], [null])
```

**Failfast mode**:

```scala
spark.read.option("mode", "failfast").json(Seq("""{"a": "str"}""", 
"""null""").toDS).collect()
```
```
org.apache.spark.SparkException: Malformed records are detected in record 
parsing. Parse Mode: FAILFAST. To process malformed records as null result, try 
setting the option 'mode' as 'PERMISSIVE'.
at 
org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:70)
at 
org.apache.spark.sql.DataFrameReader.$anonfun$json$7(DataFrameReader.scala:540)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
```

To make the permissive mode to proceed and parse without throwing an 
exception.

**Permissive mode:**

```scala
spark.read.json(Seq("""[{"a": "str"}, null]""").toDS).collect()
```

Before:

```
java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
```

After:

```
res0: Array[org.apache.spark.sql.Row] = Array([null])
```

NOTE that this behaviour is consistent when JSON object is malformed:

```scala
spark.read.schema("a int").json(Seq("""[{"a": 123}, {123123}, {"a": 
123}]""").toDS).collect()
```

```
res0: Array[org.apache.spark.sql.Row] = Array([null])
```

Since we're parsing _one_ JSON array, related records all fail together.

**Failfast mode:**

```scala
spark.read.option("mode", "failfast").json(Seq("""[{"a": "str"}, 
null]""").toDS).collect()
```

Before:

```
java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
```

After:

```
org.apache.spark.SparkException: Malformed records are detected in record 
parsing. Parse Mode: FAILFAST. To process malformed records as null result, try 
setting the option 'mode' as 'PERMISSIVE'.
at 
org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:70)
at 
org.apache.spark.sql.DataFrameReader.$anonfun$json$7(DataFrameReader.scala:540)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
```

Manually tested, and unit test was added.

Closes #33608 from HyukjinKwon/SPARK-36379.

Authored-by: Hyukjin Kwon 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 0bbcbc65080cd67a9997f49906d9d48fdf21db10)
Signed-off-by: Dongjoon Hyun 
---
 .../org/apache/spark/sql/catalyst/json/JacksonParser.scala |  9 ++---
 .../spark/sql/execution/datasources/json/JsonSuite.scala   | 14 ++
 2 files changed, 20 insertions(+), 3 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
index 8a1191c..04a0f1a 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
@@ -108,7 +108,7 @@ class JacksonParser(
 // List([str_a_2,null], [null,str_b_3])

[spark] branch master updated: [SPARK-36379][SQL] Null at root level of a JSON array should not fail w/ permissive mode

2021-08-02 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0bbcbc6  [SPARK-36379][SQL] Null at root level of a JSON array should 
not fail w/ permissive mode
0bbcbc6 is described below

commit 0bbcbc65080cd67a9997f49906d9d48fdf21db10
Author: Hyukjin Kwon 
AuthorDate: Mon Aug 2 10:01:12 2021 -0700

[SPARK-36379][SQL] Null at root level of a JSON array should not fail w/ 
permissive mode

### What changes were proposed in this pull request?

This PR proposes to fail properly so JSON parser can proceed and parse the 
input with the permissive mode.
Previously, we passed `null`s as are, the root `InternalRow`s became 
`null`s, and it causes the query fails even with permissive mode on.
Now, we fail explicitly if `null` is passed when the input array contains 
`null`.

Note that this is consistent with non-array JSON input:

**Permissive mode:**

```scala
spark.read.json(Seq("""{"a": "str"}""", """null""").toDS).collect()
```
```
res0: Array[org.apache.spark.sql.Row] = Array([str], [null])
```

**Failfast mode**:

```scala
spark.read.option("mode", "failfast").json(Seq("""{"a": "str"}""", 
"""null""").toDS).collect()
```
```
org.apache.spark.SparkException: Malformed records are detected in record 
parsing. Parse Mode: FAILFAST. To process malformed records as null result, try 
setting the option 'mode' as 'PERMISSIVE'.
at 
org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:70)
at 
org.apache.spark.sql.DataFrameReader.$anonfun$json$7(DataFrameReader.scala:540)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
```

### Why are the changes needed?

To make the permissive mode to proceed and parse without throwing an 
exception.

### Does this PR introduce _any_ user-facing change?

**Permissive mode:**

```scala
spark.read.json(Seq("""[{"a": "str"}, null]""").toDS).collect()
```

Before:

```
java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
```

After:

```
res0: Array[org.apache.spark.sql.Row] = Array([null])
```

NOTE that this behaviour is consistent when JSON object is malformed:

```scala
spark.read.schema("a int").json(Seq("""[{"a": 123}, {123123}, {"a": 
123}]""").toDS).collect()
```

```
res0: Array[org.apache.spark.sql.Row] = Array([null])
```

Since we're parsing _one_ JSON array, related records all fail together.

**Failfast mode:**

```scala
spark.read.option("mode", "failfast").json(Seq("""[{"a": "str"}, 
null]""").toDS).collect()
```

Before:

```
java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
```

After:

```
org.apache.spark.SparkException: Malformed records are detected in record 
parsing. Parse Mode: FAILFAST. To process malformed records as null result, try 
setting the option 'mode' as 'PERMISSIVE'.
at 
org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:70)
at 
org.apache.spark.sql.DataFrameReader.$anonfun$json$7(DataFrameReader.scala:540)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
```

### How was this patch tested?

Manually tested, and unit test was added.

Closes #33608 from HyukjinKwon/SPARK-36379.

Authored-by: Hyukjin Kwon 
Signed-off-by: Dongjoon Hyun 
---
 .../org/apache/spark/sql/catalyst/json/JacksonParser.scala |  9 ++---
 .../spark/sql/execution/datasources/json/JsonSuite.scala   | 14 ++
 2 files changed, 20 insertions(+), 3 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
index 54cf251..dfa746f 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
@@

[spark] branch master updated: [SPARK-35430][K8S] Switch on "PVs with local storage" integration test on Docker driver

2021-08-02 Thread shaneknapp

This is an automated email from the ASF dual-hosted git repository.

shaneknapp pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 7b90fd2  [SPARK-35430][K8S] Switch on "PVs with local storage" 
integration test on Docker driver
7b90fd2 is described below

commit 7b90fd2ca79b9a1fec5fca0bdcc169c7962ad880
Author: attilapiros 
AuthorDate: Mon Aug 2 09:17:29 2021 -0700

[SPARK-35430][K8S] Switch on "PVs with local storage" integration test on 
Docker driver

### What changes were proposed in this pull request?

Switching back the  "PVs with local storage" integration test on Docker 
driver.

I have analyzed why this test was failing on my machine (I hope the root 
cause of the problem is OS agnostic).
It failed because of the mounting of the host directory into the Minikube 
node using the `--uid=185` (Spark user user id):

```
$ minikube mount ${PVC_TESTS_HOST_PATH}:${PVC_TESTS_VM_PATH} 
--9p-version=9p2000.L --gid=0 --uid=185 &; MOUNT_PID=$!
```

Are referring to a nonexistent user. See the the number of occurence of 185 
in "/etc/passwd":

```
$ minikube ssh "grep -c 185 /etc/passwd"
0
```

This leads to a permission denied. Skipping the `--uid=185` won't help 
although the path will listable before the test execution:

```
╭─attilazsoltpirosapiros-MBP16 ~/git/attilapiros/spark ‹SPARK-35430*›
╰─$   Mounting host path 
/var/folders/t_/fr_vqcyx23vftk81ftz1k5hwgn/T/tmp.k9X4Gecv into VM as 
/var/folders/t_/fr_vqcyx23vftk81ftz1k5hwgn/T/tmp.k9X4Gecv ...
▪ Mount type:
▪ User ID:  docker
▪ Group ID: 0
▪ Version:  9p2000.L
▪ Message Size: 262144
▪ Permissions:  755 (-rwxr-xr-x)
▪ Options:  map[]
▪ Bind Address: 127.0.0.1:51740
  Userspace file server: ufs starting

╭─attilazsoltpirosapiros-MBP16 ~/git/attilapiros/spark ‹SPARK-35430*›
╰─$ minikube ssh "ls 
/var/folders/t_/fr_vqcyx23vftk81ftz1k5hwgn/T/tmp.k9X4Gecv"
╭─attilazsoltpirosapiros-MBP16 ~/git/attilapiros/spark ‹SPARK-35430*›
╰─$
```

But the test will fail and after its execution the `dmesg` shows the 
following error:
```
[13670.493359] bpfilter: Loaded bpfilter_umh pid 66153
[13670.493363] bpfilter: write fail -32
[13670.530737] bpfilter: Loaded bpfilter_umh pid 66155
...
```

This `bpfilter` is a firewall module and we are back to a permission denied 
when we want to list the mounted directory.

The solution is to add a spark user with 185 uid when the minikube is 
started.

**So this must be added to Jenkins job (and the mount should use --gid=0 
--uid=185)**:

```
$ minikube ssh "sudo useradd spark -u 185 -g 0 -m -s /bin/bash"
```

### Why are the changes needed?

This integration test is needed to validate the PVs feature.

### Does this PR introduce _any_ user-facing change?

No. It is just testing.

### How was this patch tested?

Running the test locally:
```
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Verify logging configuration is picked from the provided 
SPARK_CONF_DIR/log4j.properties
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- PVs with local storage
```

The "PVs with local storage" was successful but the next test `Launcher 
client dependencies` the minio stops the test executions on Mac (only on Mac):
```
21/06/29 04:33:32.449 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils:   Starting tunnel for service minio-s3.
21/06/29 04:33:33.425 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils: 
|--|--|-||
21/06/29 04:33:33.426 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils: |NAMESPACE |   NAME   | TARGET PORT | 
 URL   |
21/06/29 04:33:33.426 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils: 
|--|--|-||
21/06/29 04:33:33.426 ScalaTest-main-running-KubernetesSuite INFO 
ProcessUtils: |

[spark] branch branch-3.2 updated: [SPARK-36086][SQL] CollapseProject project replace alias should use origin column name

2021-08-02 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new ea559ad  [SPARK-36086][SQL] CollapseProject project replace alias 
should use origin column name
ea559ad is described below

commit ea559adc2ec1664939fcf1a23a194d4055ad9596
Author: Angerszh 
AuthorDate: Tue Aug 3 00:08:13 2021 +0800

[SPARK-36086][SQL] CollapseProject project replace alias should use origin 
column name

### What changes were proposed in this pull request?
For added UT, without this patch will failed as below
```
[info] - SHOW TABLES V2: SPARK-36086: CollapseProject project replace alias 
should use origin column name *** FAILED *** (4 seconds, 935 milliseconds)
[info]   java.lang.RuntimeException: After applying rule 
org.apache.spark.sql.catalyst.optimizer.CollapseProject in batch Operator 
Optimization before Inferring Filters, the structural integrity of the plan is 
broken.
[info]   at 
org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1217)
[info]   at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:229)
[info]   at 
scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
[info]   at 
scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
[info]   at scala.collection.immutable.List.foldLeft(List.scala:91)
[info]   at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:208)
[info]   at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200)
[info]   at scala.collection.immutable.List.foreach(List.scala:431)
[info]   at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200)
[info]   at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179)
[info]   at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
```

CollapseProject project replace alias should use origin column name
### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #33576 from AngersZh/SPARK-36086.

Authored-by: Angerszh 
Signed-off-by: Wenchen Fan 
(cherry picked from commit f3173956cbd64c056424b743aff8d17dd7c61fd7)
Signed-off-by: Wenchen Fan 
---
 .../org/apache/spark/sql/catalyst/expressions/AliasHelper.scala  | 2 +-
 .../apache/spark/sql/catalyst/expressions/namedExpressions.scala | 8 
 .../spark/sql/catalyst/optimizer/CollapseProjectSuite.scala  | 9 +
 .../approved-plans-v1_4/q5.sf100/explain.txt | 8 
 .../approved-plans-v1_4/q5.sf100/simplified.txt  | 6 +++---
 .../tpcds-plan-stability/approved-plans-v1_4/q5/explain.txt  | 8 
 .../tpcds-plan-stability/approved-plans-v1_4/q5/simplified.txt   | 6 +++---
 7 files changed, 32 insertions(+), 15 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala
index 1f3f762..0007d38 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala
@@ -72,7 +72,7 @@ trait AliasHelper {
 // Use transformUp to prevent infinite recursion when the replacement 
expression
 // redefines the same ExprId,
 trimNonTopLevelAliases(expr.transformUp {
-  case a: Attribute => aliasMap.getOrElse(a, a)
+  case a: Attribute => aliasMap.get(a).map(_.withName(a.name)).getOrElse(a)
 }).asInstanceOf[NamedExpression]
   }
 
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
index ae2c66c..71f193e 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
@@ -188,6 +188,14 @@ case class Alias(child: Expression, name: String)(
 }
   }
 
+  def withName(newName: String): NamedExpression = {
+Alias(child, newName)(
+  exprId = exprId,
+  qualifier = qualifier,
+  explicitMetadata = explicitMetadata,
+  nonInheritableMetadataKeys = nonInheritableMetadataKeys)
+  }
+
   def newInstance(): NamedExpression =
 Alias(child, name)(
   qualifier = qualifier,
diff --git

[spark] branch master updated: [SPARK-36086][SQL] CollapseProject project replace alias should use origin column name

2021-08-02 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f317395  [SPARK-36086][SQL] CollapseProject project replace alias 
should use origin column name
f317395 is described below

commit f3173956cbd64c056424b743aff8d17dd7c61fd7
Author: Angerszh 
AuthorDate: Tue Aug 3 00:08:13 2021 +0800

[SPARK-36086][SQL] CollapseProject project replace alias should use origin 
column name

### What changes were proposed in this pull request?
For added UT, without this patch will failed as below
```
[info] - SHOW TABLES V2: SPARK-36086: CollapseProject project replace alias 
should use origin column name *** FAILED *** (4 seconds, 935 milliseconds)
[info]   java.lang.RuntimeException: After applying rule 
org.apache.spark.sql.catalyst.optimizer.CollapseProject in batch Operator 
Optimization before Inferring Filters, the structural integrity of the plan is 
broken.
[info]   at 
org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1217)
[info]   at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:229)
[info]   at 
scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
[info]   at 
scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
[info]   at scala.collection.immutable.List.foldLeft(List.scala:91)
[info]   at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:208)
[info]   at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200)
[info]   at scala.collection.immutable.List.foreach(List.scala:431)
[info]   at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200)
[info]   at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179)
[info]   at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
```

CollapseProject project replace alias should use origin column name
### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #33576 from AngersZh/SPARK-36086.

Authored-by: Angerszh 
Signed-off-by: Wenchen Fan 
---
 .../org/apache/spark/sql/catalyst/expressions/AliasHelper.scala  | 2 +-
 .../apache/spark/sql/catalyst/expressions/namedExpressions.scala | 8 
 .../spark/sql/catalyst/optimizer/CollapseProjectSuite.scala  | 9 +
 .../approved-plans-v1_4/q5.sf100/explain.txt | 8 
 .../approved-plans-v1_4/q5.sf100/simplified.txt  | 6 +++---
 .../tpcds-plan-stability/approved-plans-v1_4/q5/explain.txt  | 8 
 .../tpcds-plan-stability/approved-plans-v1_4/q5/simplified.txt   | 6 +++---
 7 files changed, 32 insertions(+), 15 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala
index 1f3f762..0007d38 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala
@@ -72,7 +72,7 @@ trait AliasHelper {
 // Use transformUp to prevent infinite recursion when the replacement 
expression
 // redefines the same ExprId,
 trimNonTopLevelAliases(expr.transformUp {
-  case a: Attribute => aliasMap.getOrElse(a, a)
+  case a: Attribute => aliasMap.get(a).map(_.withName(a.name)).getOrElse(a)
 }).asInstanceOf[NamedExpression]
   }
 
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
index ae2c66c..71f193e 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
@@ -188,6 +188,14 @@ case class Alias(child: Expression, name: String)(
 }
   }
 
+  def withName(newName: String): NamedExpression = {
+Alias(child, newName)(
+  exprId = exprId,
+  qualifier = qualifier,
+  explicitMetadata = explicitMetadata,
+  nonInheritableMetadataKeys = nonInheritableMetadataKeys)
+  }
+
   def newInstance(): NamedExpression =
 Alias(child, name)(
   qualifier = qualifier,
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala

[spark] branch master updated (2f70077 -> 366f7fe)

2021-08-02 Thread gengliang

This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 2f70077  [SPARK-36224][SQL] Use Void as the type name of NullType
 add 366f7fe  [SPARK-36382][WEBUI] Remove noisy footer from the summary 
table for metrics

No new revisions were added by this update.

Summary of changes:
 core/src/main/resources/org/apache/spark/ui/static/stagepage.js | 1 +
 1 file changed, 1 insertion(+)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated: [SPARK-36224][SQL] Use Void as the type name of NullType

2021-08-02 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new e26cb96  [SPARK-36224][SQL] Use Void as the type name of NullType
e26cb96 is described below

commit e26cb968bdaff1cce1d5c050226eac1d01e3e947
Author: Linhong Liu 
AuthorDate: Mon Aug 2 23:19:54 2021 +0800

[SPARK-36224][SQL] Use Void as the type name of NullType

### What changes were proposed in this pull request?
Change the `NullType.simpleString` to "void" to set "void" as the formal 
type name of `NullType`

### Why are the changes needed?
This PR is intended to address the type name discussion in PR #28833. Here 
are the reasons:
1. The type name of NullType is displayed everywhere, e.g. schema string, 
error message, document. Hence it's not possible to hide it from users, we have 
to choose a proper name
2. The "void" is widely used as the type name of "NULL", e.g. Hive, pgSQL
3. Changing to "void" can enable the round trip of `toDDL`/`fromDDL` for 
NullType. (i.e. make `from_json(col, schema.toDDL)`) work

### Does this PR introduce _any_ user-facing change?
Yes, the type name of "NULL" is changed from "null" to "void". for example:
```
scala> sql("select null as a, 1 as b").schema.catalogString
res5: String = struct
```

### How was this patch tested?
existing test cases

Closes #33437 from linhongliu-db/SPARK-36224-void-type-name.

Authored-by: Linhong Liu 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 2f700773c2e8fac26661d0aa8024253556a921ba)
Signed-off-by: Wenchen Fan 
---
 python/pyspark/sql/tests/test_types.py  |  3 +--
 python/pyspark/sql/types.py |  4 +++-
 .../scala/org/apache/spark/sql/types/DataType.scala |  2 ++
 .../scala/org/apache/spark/sql/types/NullType.scala |  2 ++
 .../org/apache/spark/sql/types/DataTypeSuite.scala  |  6 ++
 .../sql-functions/sql-expression-schema.md  |  6 +++---
 .../sql-tests/results/ansi/literals.sql.out |  2 +-
 .../sql-tests/results/ansi/string-functions.sql.out |  4 ++--
 .../sql-tests/results/inline-table.sql.out  |  2 +-
 .../resources/sql-tests/results/literals.sql.out|  2 +-
 .../sql-tests/results/misc-functions.sql.out|  4 ++--
 .../sql-tests/results/postgreSQL/select.sql.out |  4 ++--
 .../sql-tests/results/postgreSQL/text.sql.out   |  6 +++---
 .../results/sql-compatibility-functions.sql.out |  6 +++---
 .../results/table-valued-functions.sql.out  |  2 +-
 .../sql-tests/results/udf/udf-inline-table.sql.out  |  2 +-
 .../apache/spark/sql/FileBasedDataSourceSuite.scala |  2 +-
 .../SparkExecuteStatementOperation.scala|  1 -
 .../spark/sql/hive/client/HiveClientImpl.scala  | 21 +
 .../spark/sql/hive/execution/HiveDDLSuite.scala | 12 ++--
 .../spark/sql/hive/orc/HiveOrcSourceSuite.scala |  2 +-
 21 files changed, 43 insertions(+), 52 deletions(-)

diff --git a/python/pyspark/sql/tests/test_types.py 
b/python/pyspark/sql/tests/test_types.py
index eb4caf0..33fe785 100644
--- a/python/pyspark/sql/tests/test_types.py
+++ b/python/pyspark/sql/tests/test_types.py
@@ -496,8 +496,7 @@ class TypesTests(ReusedSQLTestCase):
 def test_parse_datatype_string(self):
 from pyspark.sql.types import _all_atomic_types, _parse_datatype_string
 for k, t in _all_atomic_types.items():
-if t != NullType:
-self.assertEqual(t(), _parse_datatype_string(k))
+self.assertEqual(t(), _parse_datatype_string(k))
 self.assertEqual(IntegerType(), _parse_datatype_string("int"))
 self.assertEqual(DecimalType(1, 1), _parse_datatype_string("decimal(1  
,1)"))
 self.assertEqual(DecimalType(10, 1), _parse_datatype_string("decimal( 
10,1 )"))
diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
index 4b5632b..5e3398b 100644
--- a/python/pyspark/sql/types.py
+++ b/python/pyspark/sql/types.py
@@ -107,7 +107,9 @@ class NullType(DataType, metaclass=DataTypeSingleton):
 
 The data type representing None, used for the types that cannot be 
inferred.
 """
-pass
+@classmethod
+def typeName(cls):
+return 'void'
 
 
 class AtomicType(DataType):
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala
index ff6a49a..585045d 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala
@@ -195,6 +195,8 @@ object DataType {
   case FIXED_DECIMAL(precision, scale) => DecimalType(precision.toInt, 
scale.toInt)
   case CHAR_TYPE(length) => CharType(length.toInt)

[spark] branch master updated (a98d919 -> 2f70077)

2021-08-02 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from a98d919  [SPARK-36206][CORE] Support shuffle data corruption diagnosis 
via shuffle checksum
 add 2f70077  [SPARK-36224][SQL] Use Void as the type name of NullType

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/tests/test_types.py  |  3 +--
 python/pyspark/sql/types.py |  4 +++-
 .../scala/org/apache/spark/sql/types/DataType.scala |  2 ++
 .../scala/org/apache/spark/sql/types/NullType.scala |  2 ++
 .../org/apache/spark/sql/types/DataTypeSuite.scala  |  6 ++
 .../sql-functions/sql-expression-schema.md  |  6 +++---
 .../sql-tests/results/ansi/literals.sql.out |  2 +-
 .../sql-tests/results/ansi/string-functions.sql.out |  4 ++--
 .../sql-tests/results/inline-table.sql.out  |  2 +-
 .../resources/sql-tests/results/literals.sql.out|  2 +-
 .../sql-tests/results/misc-functions.sql.out|  4 ++--
 .../sql-tests/results/postgreSQL/select.sql.out |  4 ++--
 .../sql-tests/results/postgreSQL/text.sql.out   |  6 +++---
 .../results/sql-compatibility-functions.sql.out |  6 +++---
 .../results/table-valued-functions.sql.out  |  2 +-
 .../sql-tests/results/udf/udf-inline-table.sql.out  |  2 +-
 .../apache/spark/sql/FileBasedDataSourceSuite.scala |  2 +-
 .../SparkExecuteStatementOperation.scala|  1 -
 .../spark/sql/hive/client/HiveClientImpl.scala  | 21 +
 .../spark/sql/hive/execution/HiveDDLSuite.scala | 12 ++--
 .../spark/sql/hive/orc/HiveOrcSourceSuite.scala |  2 +-
 21 files changed, 43 insertions(+), 52 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated: [SPARK-36206][CORE] Support shuffle data corruption diagnosis via shuffle checksum

2021-08-02 Thread mridulm80

This is an automated email from the ASF dual-hosted git repository.

mridulm80 pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new df43300  [SPARK-36206][CORE] Support shuffle data corruption diagnosis 
via shuffle checksum
df43300 is described below

commit df433002277911c2e66bcf00a17b636a317b4a78
Author: yi.wu 
AuthorDate: Mon Aug 2 09:58:36 2021 -0500

[SPARK-36206][CORE] Support shuffle data corruption diagnosis via shuffle 
checksum

### What changes were proposed in this pull request?

This PR adds support to diagnose shuffle data corruption. Basically, the 
diagnosis mechanism works like this:
The shuffler reader would calculate the checksum (c1) for the corrupted 
shuffle block and send it to the server where the block is stored. At the 
server, it would read back the checksum (c2) that is stored in the checksum 
file and recalculate the checksum (c3) for the corresponding shuffle block. 
Then, if c2 != c3, we suspect the corruption is caused by the disk issue. 
Otherwise, if c1 != c3, we suspect the corruption is caused by the network 
issue. Otherwise, the checksum verifies pa [...]

After the shuffle reader receives the diagnosis response, it'd take the 
action bases on the type of cause. Only in case of the network issue, we'd give 
a retry. Otherwise, we'd throw the fetch failure directly. Also note that, if 
the corruption happens inside BufferReleasingInputStream, the reducer will 
throw the fetch failure immediately no matter what the cause is since the data 
has been partially consumed by downstream RDDs. If corruption happens again 
after retry, the reducer will [...]

Please check out https://github.com/apache/spark/pull/32385 to see the 
completed proposal of the shuffle checksum project.

### Why are the changes needed?

Shuffle data corruption is a long-standing issue in Spark. For example, in 
SPARK-18105, people continually reports corruption issue. However, data 
corruption is difficult to reproduce in most cases and even harder to tell the 
root cause. We don't know if it's a Spark issue or not. With the diagnosis 
support for the shuffle corruption, Spark itself can at least distinguish the 
cause between disk and network, which is very important for users.

### Does this PR introduce _any_ user-facing change?

Yes, users may know the cause of the shuffle corruption after this change.

### How was this patch tested?

Added tests.

Closes #33451 from Ngone51/SPARK-36206.

Authored-by: yi.wu 
Signed-off-by: Mridul Muralidharan gmail.com>
(cherry picked from commit a98d919da470abaf2e99060f99007a5373032fe1)
Signed-off-by: Mridul Muralidharan 
---
 .../spark/network/shuffle/BlockStoreClient.java|  45 +-
 .../network/shuffle/ExternalBlockHandler.java  |  13 +-
 .../network/shuffle/ExternalBlockStoreClient.java  |  22 +--
 .../shuffle/ExternalShuffleBlockResolver.java  |  25 
 .../spark/network/shuffle/checksum/Cause.java  |  25 
 .../shuffle/checksum/ShuffleChecksumHelper.java| 158 +
 .../shuffle/protocol/BlockTransferMessage.java |   4 +-
 .../network/shuffle/protocol/CorruptionCause.java  |  74 ++
 .../shuffle/protocol/DiagnoseCorruption.java   | 131 +
 .../network/shuffle/ExternalBlockHandlerSuite.java | 115 ++-
 .../shuffle/checksum/ShuffleChecksumHelper.java| 100 -
 .../shuffle/checksum/ShuffleChecksumSupport.java   |  45 ++
 .../shuffle/sort/BypassMergeSortShuffleWriter.java |  15 +-
 .../spark/shuffle/sort/ShuffleExternalSorter.java  |   9 +-
 .../spark/shuffle/sort/UnsafeShuffleWriter.java|   2 +-
 .../org/apache/spark/internal/config/package.scala |   4 +-
 .../apache/spark/network/BlockDataManager.scala|   9 ++
 .../spark/network/netty/NettyBlockRpcServer.scala  |   7 +
 .../network/netty/NettyBlockTransferService.scala  |   3 +-
 .../spark/shuffle/BlockStoreShuffleReader.scala|   2 +
 .../spark/shuffle/IndexShuffleBlockResolver.scala  |  14 +-
 .../scala/org/apache/spark/storage/BlockId.scala   |   2 +-
 .../org/apache/spark/storage/BlockManager.scala|  25 +++-
 .../storage/ShuffleBlockFetcherIterator.scala  | 129 +++--
 .../spark/util/collection/ExternalSorter.scala |   9 +-
 .../shuffle/sort/UnsafeShuffleWriterSuite.java |  25 ++--
 .../test/scala/org/apache/spark/ShuffleSuite.scala |  37 -
 .../spark/shuffle/ShuffleChecksumTestHelper.scala  |  11 +-
 .../sort/BypassMergeSortShuffleWriterSuite.scala   |  13 +-
 .../sort/IndexShuffleBlockResolverSuite.scala  |   2 +-
 .../shuffle/sort/SortShuffleWriterSuite.scala  |   6 +-
 .../storage/ShuffleBlockFetcherIteratorSuite.scala |  82 ++-
 32 files changed, 973 insertions(+), 190 deletions(-)

diff --git

[spark] branch master updated: [SPARK-36206][CORE] Support shuffle data corruption diagnosis via shuffle checksum

2021-08-02 Thread mridulm80

This is an automated email from the ASF dual-hosted git repository.

mridulm80 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a98d919  [SPARK-36206][CORE] Support shuffle data corruption diagnosis 
via shuffle checksum
a98d919 is described below

commit a98d919da470abaf2e99060f99007a5373032fe1
Author: yi.wu 
AuthorDate: Mon Aug 2 09:58:36 2021 -0500

[SPARK-36206][CORE] Support shuffle data corruption diagnosis via shuffle 
checksum

### What changes were proposed in this pull request?

This PR adds support to diagnose shuffle data corruption. Basically, the 
diagnosis mechanism works like this:
The shuffler reader would calculate the checksum (c1) for the corrupted 
shuffle block and send it to the server where the block is stored. At the 
server, it would read back the checksum (c2) that is stored in the checksum 
file and recalculate the checksum (c3) for the corresponding shuffle block. 
Then, if c2 != c3, we suspect the corruption is caused by the disk issue. 
Otherwise, if c1 != c3, we suspect the corruption is caused by the network 
issue. Otherwise, the checksum verifies pa [...]

After the shuffle reader receives the diagnosis response, it'd take the 
action bases on the type of cause. Only in case of the network issue, we'd give 
a retry. Otherwise, we'd throw the fetch failure directly. Also note that, if 
the corruption happens inside BufferReleasingInputStream, the reducer will 
throw the fetch failure immediately no matter what the cause is since the data 
has been partially consumed by downstream RDDs. If corruption happens again 
after retry, the reducer will [...]

Please check out https://github.com/apache/spark/pull/32385 to see the 
completed proposal of the shuffle checksum project.

### Why are the changes needed?

Shuffle data corruption is a long-standing issue in Spark. For example, in 
SPARK-18105, people continually reports corruption issue. However, data 
corruption is difficult to reproduce in most cases and even harder to tell the 
root cause. We don't know if it's a Spark issue or not. With the diagnosis 
support for the shuffle corruption, Spark itself can at least distinguish the 
cause between disk and network, which is very important for users.

### Does this PR introduce _any_ user-facing change?

Yes, users may know the cause of the shuffle corruption after this change.

### How was this patch tested?

Added tests.

Closes #33451 from Ngone51/SPARK-36206.

Authored-by: yi.wu 
Signed-off-by: Mridul Muralidharan gmail.com>
---
 .../spark/network/shuffle/BlockStoreClient.java|  45 +-
 .../network/shuffle/ExternalBlockHandler.java  |  13 +-
 .../network/shuffle/ExternalBlockStoreClient.java  |  22 +--
 .../shuffle/ExternalShuffleBlockResolver.java  |  25 
 .../spark/network/shuffle/checksum/Cause.java  |  25 
 .../shuffle/checksum/ShuffleChecksumHelper.java| 158 +
 .../shuffle/protocol/BlockTransferMessage.java |   4 +-
 .../network/shuffle/protocol/CorruptionCause.java  |  74 ++
 .../shuffle/protocol/DiagnoseCorruption.java   | 131 +
 .../network/shuffle/ExternalBlockHandlerSuite.java | 115 ++-
 .../shuffle/checksum/ShuffleChecksumHelper.java| 100 -
 .../shuffle/checksum/ShuffleChecksumSupport.java   |  45 ++
 .../shuffle/sort/BypassMergeSortShuffleWriter.java |  15 +-
 .../spark/shuffle/sort/ShuffleExternalSorter.java  |   9 +-
 .../spark/shuffle/sort/UnsafeShuffleWriter.java|   2 +-
 .../org/apache/spark/internal/config/package.scala |   4 +-
 .../apache/spark/network/BlockDataManager.scala|   9 ++
 .../spark/network/netty/NettyBlockRpcServer.scala  |   7 +
 .../network/netty/NettyBlockTransferService.scala  |   3 +-
 .../spark/shuffle/BlockStoreShuffleReader.scala|   2 +
 .../spark/shuffle/IndexShuffleBlockResolver.scala  |  14 +-
 .../scala/org/apache/spark/storage/BlockId.scala   |   2 +-
 .../org/apache/spark/storage/BlockManager.scala|  25 +++-
 .../storage/ShuffleBlockFetcherIterator.scala  | 129 +++--
 .../spark/util/collection/ExternalSorter.scala |   9 +-
 .../shuffle/sort/UnsafeShuffleWriterSuite.java |  25 ++--
 .../test/scala/org/apache/spark/ShuffleSuite.scala |  37 -
 .../spark/shuffle/ShuffleChecksumTestHelper.scala  |  11 +-
 .../sort/BypassMergeSortShuffleWriterSuite.scala   |  13 +-
 .../sort/IndexShuffleBlockResolverSuite.scala  |   2 +-
 .../shuffle/sort/SortShuffleWriterSuite.scala  |   6 +-
 .../storage/ShuffleBlockFetcherIteratorSuite.scala |  82 ++-
 32 files changed, 973 insertions(+), 190 deletions(-)

diff --git 
a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/BlockStoreClient.java

[spark] branch master updated (951efb8 -> be06e41)

2021-08-02 Thread gengliang

This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 951efb8  [SPARK-36237][UI][SQL] Attach and start handler after 
application started in UI
 add be06e41  [SPARK-35918][AVRO] Unify schema mismatch handling for 
read/write and enhance error messages

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/sql/avro/AvroDeserializer.scala   | 54 ++--
 .../org/apache/spark/sql/avro/AvroSerializer.scala | 35 +++---
 .../org/apache/spark/sql/avro/AvroUtils.scala  | 75 +-
 .../spark/sql/avro/AvroSchemaHelperSuite.scala | 59 +++--
 .../org/apache/spark/sql/avro/AvroSerdeSuite.scala | 46 +++--
 .../org/apache/spark/sql/avro/AvroSuite.scala  |  3 +-
 6 files changed, 173 insertions(+), 99 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (3b713e7 -> 951efb8)

2021-08-02 Thread gengliang

This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 3b713e7  [SPARK-36372][SQL] v2 ALTER TABLE ADD COLUMNS should check 
duplicates for the user specified columns
 add 951efb8  [SPARK-36237][UI][SQL] Attach and start handler after 
application started in UI

No new revisions were added by this update.

Summary of changes:
 .../main/scala/org/apache/spark/SparkContext.scala |  7 ++--
 .../main/scala/org/apache/spark/TestUtils.scala| 13 +++
 .../main/scala/org/apache/spark/ui/SparkUI.scala   | 40 ++
 .../src/main/scala/org/apache/spark/ui/WebUI.scala | 13 ---
 .../test/scala/org/apache/spark/ui/UISuite.scala   | 22 
 5 files changed, 89 insertions(+), 6 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.2 updated: [SPARK-36372][SQL] v2 ALTER TABLE ADD COLUMNS should check duplicates for the user specified columns

2021-08-02 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 87ae397  [SPARK-36372][SQL] v2 ALTER TABLE ADD COLUMNS should check 
duplicates for the user specified columns
87ae397 is described below

commit 87ae3978971c46a17f8a550f9d3e6934a74cc3a4
Author: Terry Kim 
AuthorDate: Mon Aug 2 17:54:50 2021 +0800

[SPARK-36372][SQL] v2 ALTER TABLE ADD COLUMNS should check duplicates for 
the user specified columns

### What changes were proposed in this pull request?

Currently, v2 ALTER TABLE ADD COLUMNS does not check duplicates for the 
user specified columns. For example,
```
spark.sql(s"CREATE TABLE $t (id int) USING $v2Format")
spark.sql("ALTER TABLE $t ADD COLUMNS (data string, data string)")
```
doesn't fail the analysis, and it's up to the catalog implementation to 
handle it. For v1 command, the duplication is checked before invoking the 
catalog.

### Why are the changes needed?

To check the duplicate columns during analysis and be consistent with v1 
command.

### Does this PR introduce _any_ user-facing change?

Yes, now the above will command will print out the fllowing:
```
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the 
user specified columns: `data`
```

### How was this patch tested?

Added new unit tests

Closes #33600 from imback82/alter_add_duplicate_columns.

Authored-by: Terry Kim 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 3b713e7f6189dfe1c5bbb1a527bf1266bde69f69)
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/analysis/CheckAnalysis.scala  |  5 
 .../spark/sql/connector/AlterTableTests.scala  | 23 
 .../connector/V2CommandsCaseSensitivitySuite.scala | 32 --
 3 files changed, 57 insertions(+), 3 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
index 2d8ac64..77f721c 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
@@ -30,6 +30,7 @@ import org.apache.spark.sql.connector.catalog.{LookupCatalog, 
SupportsPartitionM
 import org.apache.spark.sql.errors.{QueryCompilationErrors, 
QueryExecutionErrors}
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types._
+import org.apache.spark.sql.util.SchemaUtils
 
 /**
  * Throws user facing errors when passed invalid queries that fail to analyze.
@@ -951,6 +952,10 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog {
 colsToAdd.foreach { colToAdd =>
   checkColumnNotExists("add", colToAdd.name, table.schema)
 }
+SchemaUtils.checkColumnNameDuplication(
+  colsToAdd.map(_.name.quoted),
+  "in the user specified columns",
+  alter.conf.resolver)
 
   case AlterTableRenameColumn(table: ResolvedTable, col: 
ResolvedFieldName, newName) =>
 checkColumnNotExists("rename", col.path :+ newName, table.schema)
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala
index 004a64a..1bd45f5 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala
@@ -384,6 +384,29 @@ trait AlterTableTests extends SharedSparkSession {
 }
   }
 
+  test("SPARK-36372: Adding duplicate columns should not be allowed") {
+val t = s"${catalogAndNamespace}table_name"
+withTable(t) {
+  sql(s"CREATE TABLE $t (id int) USING $v2Format")
+  val e = intercept[AnalysisException] {
+sql(s"ALTER TABLE $t ADD COLUMNS (data string, data1 string, data 
string)")
+  }
+  assert(e.message.contains("Found duplicate column(s) in the user 
specified columns: `data`"))
+}
+  }
+
+  test("SPARK-36372: Adding duplicate nested columns should not be allowed") {
+val t = s"${catalogAndNamespace}table_name"
+withTable(t) {
+  sql(s"CREATE TABLE $t (id int, point struct) USING 
$v2Format")
+  val e = intercept[AnalysisException] {
+sql(s"ALTER TABLE $t ADD COLUMNS (point.z double, point.z double, 
point.xx double)")
+  }
+  assert(e.message.contains(
+"Found duplicate column(s) in the user specified columns: `point.z`"))
+}
+  }
+
   test("AlterTable: update column type int -> long") {
 val t = s"${catalogAndNamespace}table_name"
 withTable(t) {
diff --git

[spark] branch master updated: [SPARK-36372][SQL] v2 ALTER TABLE ADD COLUMNS should check duplicates for the user specified columns

2021-08-02 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 3b713e7  [SPARK-36372][SQL] v2 ALTER TABLE ADD COLUMNS should check 
duplicates for the user specified columns
3b713e7 is described below

commit 3b713e7f6189dfe1c5bbb1a527bf1266bde69f69
Author: Terry Kim 
AuthorDate: Mon Aug 2 17:54:50 2021 +0800

[SPARK-36372][SQL] v2 ALTER TABLE ADD COLUMNS should check duplicates for 
the user specified columns

### What changes were proposed in this pull request?

Currently, v2 ALTER TABLE ADD COLUMNS does not check duplicates for the 
user specified columns. For example,
```
spark.sql(s"CREATE TABLE $t (id int) USING $v2Format")
spark.sql("ALTER TABLE $t ADD COLUMNS (data string, data string)")
```
doesn't fail the analysis, and it's up to the catalog implementation to 
handle it. For v1 command, the duplication is checked before invoking the 
catalog.

### Why are the changes needed?

To check the duplicate columns during analysis and be consistent with v1 
command.

### Does this PR introduce _any_ user-facing change?

Yes, now the above will command will print out the fllowing:
```
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the 
user specified columns: `data`
```

### How was this patch tested?

Added new unit tests

Closes #33600 from imback82/alter_add_duplicate_columns.

Authored-by: Terry Kim 
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/analysis/CheckAnalysis.scala  |  5 
 .../spark/sql/connector/AlterTableTests.scala  | 23 
 .../connector/V2CommandsCaseSensitivitySuite.scala | 32 --
 3 files changed, 57 insertions(+), 3 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
index 2d8ac64..77f721c 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
@@ -30,6 +30,7 @@ import org.apache.spark.sql.connector.catalog.{LookupCatalog, 
SupportsPartitionM
 import org.apache.spark.sql.errors.{QueryCompilationErrors, 
QueryExecutionErrors}
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types._
+import org.apache.spark.sql.util.SchemaUtils
 
 /**
  * Throws user facing errors when passed invalid queries that fail to analyze.
@@ -951,6 +952,10 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog {
 colsToAdd.foreach { colToAdd =>
   checkColumnNotExists("add", colToAdd.name, table.schema)
 }
+SchemaUtils.checkColumnNameDuplication(
+  colsToAdd.map(_.name.quoted),
+  "in the user specified columns",
+  alter.conf.resolver)
 
   case AlterTableRenameColumn(table: ResolvedTable, col: 
ResolvedFieldName, newName) =>
 checkColumnNotExists("rename", col.path :+ newName, table.schema)
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala
index 004a64a..1bd45f5 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala
@@ -384,6 +384,29 @@ trait AlterTableTests extends SharedSparkSession {
 }
   }
 
+  test("SPARK-36372: Adding duplicate columns should not be allowed") {
+val t = s"${catalogAndNamespace}table_name"
+withTable(t) {
+  sql(s"CREATE TABLE $t (id int) USING $v2Format")
+  val e = intercept[AnalysisException] {
+sql(s"ALTER TABLE $t ADD COLUMNS (data string, data1 string, data 
string)")
+  }
+  assert(e.message.contains("Found duplicate column(s) in the user 
specified columns: `data`"))
+}
+  }
+
+  test("SPARK-36372: Adding duplicate nested columns should not be allowed") {
+val t = s"${catalogAndNamespace}table_name"
+withTable(t) {
+  sql(s"CREATE TABLE $t (id int, point struct) USING 
$v2Format")
+  val e = intercept[AnalysisException] {
+sql(s"ALTER TABLE $t ADD COLUMNS (point.z double, point.z double, 
point.xx double)")
+  }
+  assert(e.message.contains(
+"Found duplicate column(s) in the user specified columns: `point.z`"))
+}
+  }
+
   test("AlterTable: update column type int -> long") {
 val t = s"${catalogAndNamespace}table_name"
 withTable(t) {
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/connector/V2CommandsCaseSensitivitySuite.scala

[spark] branch branch-3.2 updated (de837a0 -> 71e4c56)

[spark] branch master updated (63517eb -> 8cb9cf3)

[spark] branch branch-3.2 updated (9eec11b -> de837a0)

[spark] branch master updated (c20af53 -> 63517eb)

[spark] branch master updated: [SPARK-36373][SQL] DecimalPrecision only add necessary cast

[spark] branch master updated (0bbcbc6 -> 7a27f8a)

[GitHub] [spark-website] dongjoon-hyun commented on pull request #351: updating local k8s/minikube testing instructions

[GitHub] [spark-website] asfgit closed pull request #351: updating local k8s/minikube testing instructions

[spark-website] branch asf-site updated: updating local k8s/minikube testing instructions

[GitHub] [spark-website] shaneknapp opened a new pull request #351: updating local k8s/minikube testing instructions

[spark] branch branch-3.2 updated: [SPARK-36379][SQL] Null at root level of a JSON array should not fail w/ permissive mode

[spark] branch master updated: [SPARK-36379][SQL] Null at root level of a JSON array should not fail w/ permissive mode

[spark] branch master updated: [SPARK-35430][K8S] Switch on "PVs with local storage" integration test on Docker driver

[spark] branch branch-3.2 updated: [SPARK-36086][SQL] CollapseProject project replace alias should use origin column name

[spark] branch master updated: [SPARK-36086][SQL] CollapseProject project replace alias should use origin column name

[spark] branch master updated (2f70077 -> 366f7fe)

[spark] branch branch-3.2 updated: [SPARK-36224][SQL] Use Void as the type name of NullType

[spark] branch master updated (a98d919 -> 2f70077)

[spark] branch branch-3.2 updated: [SPARK-36206][CORE] Support shuffle data corruption diagnosis via shuffle checksum

[spark] branch master updated: [SPARK-36206][CORE] Support shuffle data corruption diagnosis via shuffle checksum

[spark] branch master updated (951efb8 -> be06e41)

[spark] branch master updated (3b713e7 -> 951efb8)

[spark] branch branch-3.2 updated: [SPARK-36372][SQL] v2 ALTER TABLE ADD COLUMNS should check duplicates for the user specified columns

[spark] branch master updated: [SPARK-36372][SQL] v2 ALTER TABLE ADD COLUMNS should check duplicates for the user specified columns

24 matches

Site Navigation

Mail list logo

Footer information