[spark] branch branch-3.2 updated (de837a0 -> 71e4c56)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git. from de837a0 [SPARK-36331][CORE] Add standard SQLSTATEs to error guidelines add 71e4c56 [SPARK-36367][3.2][PYTHON] Partially backport to avoid unexpected error with pandas 1.3 No new revisions were added by this update. Summary of changes: python/pyspark/pandas/groupby.py | 18 +++--- python/pyspark/pandas/series.py | 4 ++-- 2 files changed, 13 insertions(+), 9 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (63517eb -> 8cb9cf3)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 63517eb [SPARK-36331][CORE] Add standard SQLSTATEs to error guidelines add 8cb9cf3 [SPARK-36345][SPARK-36367][INFRA][PYTHON] Disable tests failed by the incompatible behavior of pandas 1.3 No new revisions were added by this update. Summary of changes: .github/workflows/build_and_test.yml | 4 +- python/pyspark/pandas/groupby.py | 18 +++-- python/pyspark/pandas/series.py| 4 +- .../tests/data_type_ops/test_categorical_ops.py| 6 +- python/pyspark/pandas/tests/indexes/test_base.py | 76 +++- .../pyspark/pandas/tests/indexes/test_category.py | 5 +- python/pyspark/pandas/tests/test_categorical.py| 82 ++ python/pyspark/pandas/tests/test_expanding.py | 51 -- .../test_ops_on_diff_frames_groupby_expanding.py | 13 ++-- .../test_ops_on_diff_frames_groupby_rolling.py | 14 ++-- python/pyspark/pandas/tests/test_rolling.py| 52 -- python/pyspark/pandas/tests/test_series.py | 16 +++-- 12 files changed, 227 insertions(+), 114 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated (9eec11b -> de837a0)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git. from 9eec11b [SPARK-36379][SQL] Null at root level of a JSON array should not fail w/ permissive mode add de837a0 [SPARK-36331][CORE] Add standard SQLSTATEs to error guidelines No new revisions were added by this update. Summary of changes: core/src/main/resources/error/README.md | 169 +++- 1 file changed, 165 insertions(+), 4 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (c20af53 -> 63517eb)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from c20af53 [SPARK-36373][SQL] DecimalPrecision only add necessary cast add 63517eb [SPARK-36331][CORE] Add standard SQLSTATEs to error guidelines No new revisions were added by this update. Summary of changes: core/src/main/resources/error/README.md | 169 +++- 1 file changed, 165 insertions(+), 4 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-36373][SQL] DecimalPrecision only add necessary cast
This is an automated email from the ASF dual-hosted git repository. yumwang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c20af53 [SPARK-36373][SQL] DecimalPrecision only add necessary cast c20af53 is described below commit c20af535803a7250fef047c2bf0fe30be242369d Author: Yuming Wang AuthorDate: Tue Aug 3 08:12:54 2021 +0800 [SPARK-36373][SQL] DecimalPrecision only add necessary cast ### What changes were proposed in this pull request? This pr makes `DecimalPrecision` only add necessary cast similar to [`ImplicitTypeCasts`](https://github.com/apache/spark/blob/96c2919988ddf78d104103876d8d8221e8145baa/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala#L675-L678). For example: ``` EqualTo(AttributeReference("d1", DecimalType(5, 2))(), AttributeReference("d2", DecimalType(2, 1))()) ``` It will add a useless cast to _d1_: ``` (cast(d1#6 as decimal(5,2)) = cast(d2#7 as decimal(5,2))) ``` ### Why are the changes needed? 1. Avoid adding unnecessary cast. Although it will be removed by `SimplifyCasts` later. 2. I'm trying to add an extended rule similar to `PullOutGroupingExpressions`. The current behavior will introduce additional alias. For example: `cast(d1 as decimal(5,2)) as cast_d1`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #33602 from wangyum/SPARK-36373. Authored-by: Yuming Wang Signed-off-by: Yuming Wang --- .../org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala | 4 +++- .../apache/spark/sql/catalyst/analysis/DecimalPrecisionSuite.scala| 3 +++ 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala index bf128cd..b112641 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala @@ -204,7 +204,9 @@ object DecimalPrecision extends TypeCoercionRule { case b @ BinaryComparison(e1 @ DecimalType.Expression(p1, s1), e2 @ DecimalType.Expression(p2, s2)) if p1 != p2 || s1 != s2 => val resultType = widerDecimalType(p1, s1, p2, s2) - b.makeCopy(Array(Cast(e1, resultType), Cast(e2, resultType))) + val newE1 = if (e1.dataType == resultType) e1 else Cast(e1, resultType) + val newE2 = if (e2.dataType == resultType) e2 else Cast(e2, resultType) + b.makeCopy(Array(newE1, newE2)) } /** diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecisionSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecisionSuite.scala index 834f38a..698d2d5 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecisionSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecisionSuite.scala @@ -59,6 +59,9 @@ class DecimalPrecisionSuite extends AnalysisTest with BeforeAndAfter { val comparison = analyzer.execute(plan).collect { case Project(Alias(e: BinaryComparison, _) :: Nil, _) => e }.head +// Only add necessary cast. +assert(comparison.left.children.forall(_.dataType !== expectedType)) +assert(comparison.right.children.forall(_.dataType !== expectedType)) assert(comparison.left.dataType === expectedType) assert(comparison.right.dataType === expectedType) } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (0bbcbc6 -> 7a27f8a)
This is an automated email from the ASF dual-hosted git repository. viirya pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 0bbcbc6 [SPARK-36379][SQL] Null at root level of a JSON array should not fail w/ permissive mode add 7a27f8a [SPARK-36137][SQL] HiveShim should fallback to getAllPartitionsOf even if directSQL is enabled in remote HMS No new revisions were added by this update. Summary of changes: .../spark/sql/errors/QueryExecutionErrors.scala| 2 +- .../org/apache/spark/sql/internal/SQLConf.scala| 13 +++ .../apache/spark/sql/hive/client/HiveShim.scala| 24 +++ .../hive/client/HivePartitionFilteringSuite.scala | 27 +- 4 files changed, 49 insertions(+), 17 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] dongjoon-hyun commented on pull request #351: updating local k8s/minikube testing instructions
dongjoon-hyun commented on pull request #351: URL: https://github.com/apache/spark-website/pull/351#issuecomment-891219026 +1, LGTM. Thank you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] asfgit closed pull request #351: updating local k8s/minikube testing instructions
asfgit closed pull request #351: URL: https://github.com/apache/spark-website/pull/351 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark-website] branch asf-site updated: updating local k8s/minikube testing instructions
This is an automated email from the ASF dual-hosted git repository. shaneknapp pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/spark-website.git The following commit(s) were added to refs/heads/asf-site by this push: new d60611b updating local k8s/minikube testing instructions d60611b is described below commit d60611b5c464db0fb99ca072a9d8b55e824ca7c2 Author: shane knapp AuthorDate: Mon Aug 2 10:15:46 2021 -0700 updating local k8s/minikube testing instructions a small update to the k8s/minikube integration test instructions Author: shane knapp Closes #351 from shaneknapp/updating-k8s-docs. --- developer-tools.md| 9 ++--- site/developer-tools.html | 9 ++--- 2 files changed, 12 insertions(+), 6 deletions(-) diff --git a/developer-tools.md b/developer-tools.md index 9551533..bf3ee6e 100644 --- a/developer-tools.md +++ b/developer-tools.md @@ -169,8 +169,8 @@ Please check other available options via `python/run-tests[-with-coverage] --hel If you have made changes to the K8S bindings in Apache Spark, it would behoove you to test locally before submitting a PR. This is relatively simple to do, but it will require a local (to you) installation of [minikube](https://kubernetes.io/docs/setup/minikube/). Due to how minikube interacts with the host system, please be sure to set things up as follows: -- minikube version v1.7.3 (or greater) -- You must use a VM driver! Running minikube with the `--vm-driver=none` option requires that the user launching minikube/k8s have root access. Our Jenkins workers use the [kvm2](https://minikube.sigs.k8s.io/docs/drivers/kvm2/) drivers. More details [here](https://minikube.sigs.k8s.io/docs/drivers/). +- minikube version v1.18.1 (or greater) +- You must use a VM driver! Running minikube with the `--vm-driver=none` option requires that the user launching minikube/k8s have root access, which could impact how the tests are run. Our Jenkins workers use the default [docker](https://minikube.sigs.k8s.io/docs/drivers/docker/) drivers. More details [here](https://minikube.sigs.k8s.io/docs/drivers/). - kubernetes version v1.17.3 (can be set by executing `minikube config set kubernetes-version v1.17.3`) - the current kubernetes context must be minikube's default context (called 'minikube'). This can be selected by `minikube kubectl -- config use-context minikube`. This is only needed when after minikube is started another kubernetes context is selected. @@ -196,8 +196,11 @@ PVC_TMP_DIR=$(mktemp -d) export PVC_TESTS_HOST_PATH=$PVC_TMP_DIR export PVC_TESTS_VM_PATH=$PVC_TMP_DIR -minikube --vm-driver= start --memory 6000 --cpus 8 minikube config set kubernetes-version v1.17.3 +minikube --vm-driver= start --memory 6000 --cpus 8 + +# for macos only (see https://github.com/apache/spark/pull/32793): +# minikube ssh "sudo useradd spark -u 185 -g 0 -m -s /bin/bash" minikube mount ${PVC_TESTS_HOST_PATH}:${PVC_TESTS_VM_PATH} --9p-version=9p2000.L --gid=0 --uid=185 &; MOUNT_PID=$! diff --git a/site/developer-tools.html b/site/developer-tools.html index 5558945..81d64c7 100644 --- a/site/developer-tools.html +++ b/site/developer-tools.html @@ -348,8 +348,8 @@ Generating HTML files for PySpark coverage under /.../spark/python/test_coverage If you have made changes to the K8S bindings in Apache Spark, it would behoove you to test locally before submitting a PR. This is relatively simple to do, but it will require a local (to you) installation of https://kubernetes.io/docs/setup/minikube/;>minikube. Due to how minikube interacts with the host system, please be sure to set things up as follows: - minikube version v1.7.3 (or greater) - You must use a VM driver! Running minikube with the --vm-driver=none option requires that the user launching minikube/k8s have root access. Our Jenkins workers use the https://minikube.sigs.k8s.io/docs/drivers/kvm2/;>kvm2 drivers. More details https://minikube.sigs.k8s.io/docs/drivers/;>here. + minikube version v1.18.1 (or greater) + You must use a VM driver! Running minikube with the --vm-driver=none option requires that the user launching minikube/k8s have root access, which could impact how the tests are run. Our Jenkins workers use the default https://minikube.sigs.k8s.io/docs/drivers/docker/;>docker drivers. More details https://minikube.sigs.k8s.io/docs/drivers/;>here. kubernetes version v1.17.3 (can be set by executing minikube config set kubernetes-version v1.17.3) the current kubernetes context must be minikubes default context (called minikube). This can be selected by minikube kubectl -- config use-context minikube. This is only needed when after minikube is started another kubernetes context is selected. @@ -374,8 +374,11 @@ export TARBALL_TO_TEST=($(pwd)/spark-*${DATE}-${REVISION}.tgz) export PVC_TESTS_HOST_PATH=$PVC_TMP_DIR export PVC_TESTS_VM_PATH=$PVC_TMP_DIR -minikube
[GitHub] [spark-website] shaneknapp opened a new pull request #351: updating local k8s/minikube testing instructions
shaneknapp opened a new pull request #351: URL: https://github.com/apache/spark-website/pull/351 a small update to the k8s/minikube integration test instructions -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-36379][SQL] Null at root level of a JSON array should not fail w/ permissive mode
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 9eec11b [SPARK-36379][SQL] Null at root level of a JSON array should not fail w/ permissive mode 9eec11b is described below commit 9eec11b9567c033b10a6d70340e63d6270b481e8 Author: Hyukjin Kwon AuthorDate: Mon Aug 2 10:01:12 2021 -0700 [SPARK-36379][SQL] Null at root level of a JSON array should not fail w/ permissive mode This PR proposes to fail properly so JSON parser can proceed and parse the input with the permissive mode. Previously, we passed `null`s as are, the root `InternalRow`s became `null`s, and it causes the query fails even with permissive mode on. Now, we fail explicitly if `null` is passed when the input array contains `null`. Note that this is consistent with non-array JSON input: **Permissive mode:** ```scala spark.read.json(Seq("""{"a": "str"}""", """null""").toDS).collect() ``` ``` res0: Array[org.apache.spark.sql.Row] = Array([str], [null]) ``` **Failfast mode**: ```scala spark.read.option("mode", "failfast").json(Seq("""{"a": "str"}""", """null""").toDS).collect() ``` ``` org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'. at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:70) at org.apache.spark.sql.DataFrameReader.$anonfun$json$7(DataFrameReader.scala:540) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) ``` To make the permissive mode to proceed and parse without throwing an exception. **Permissive mode:** ```scala spark.read.json(Seq("""[{"a": "str"}, null]""").toDS).collect() ``` Before: ``` java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) ``` After: ``` res0: Array[org.apache.spark.sql.Row] = Array([null]) ``` NOTE that this behaviour is consistent when JSON object is malformed: ```scala spark.read.schema("a int").json(Seq("""[{"a": 123}, {123123}, {"a": 123}]""").toDS).collect() ``` ``` res0: Array[org.apache.spark.sql.Row] = Array([null]) ``` Since we're parsing _one_ JSON array, related records all fail together. **Failfast mode:** ```scala spark.read.option("mode", "failfast").json(Seq("""[{"a": "str"}, null]""").toDS).collect() ``` Before: ``` java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) ``` After: ``` org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'. at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:70) at org.apache.spark.sql.DataFrameReader.$anonfun$json$7(DataFrameReader.scala:540) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) ``` Manually tested, and unit test was added. Closes #33608 from HyukjinKwon/SPARK-36379. Authored-by: Hyukjin Kwon Signed-off-by: Dongjoon Hyun (cherry picked from commit 0bbcbc65080cd67a9997f49906d9d48fdf21db10) Signed-off-by: Dongjoon Hyun --- .../org/apache/spark/sql/catalyst/json/JacksonParser.scala | 9 ++--- .../spark/sql/execution/datasources/json/JsonSuite.scala | 14 ++ 2 files changed, 20 insertions(+), 3 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala index 8a1191c..04a0f1a 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala @@ -108,7 +108,7 @@ class JacksonParser( // List([str_a_2,null], [null,str_b_3])
[spark] branch master updated: [SPARK-36379][SQL] Null at root level of a JSON array should not fail w/ permissive mode
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 0bbcbc6 [SPARK-36379][SQL] Null at root level of a JSON array should not fail w/ permissive mode 0bbcbc6 is described below commit 0bbcbc65080cd67a9997f49906d9d48fdf21db10 Author: Hyukjin Kwon AuthorDate: Mon Aug 2 10:01:12 2021 -0700 [SPARK-36379][SQL] Null at root level of a JSON array should not fail w/ permissive mode ### What changes were proposed in this pull request? This PR proposes to fail properly so JSON parser can proceed and parse the input with the permissive mode. Previously, we passed `null`s as are, the root `InternalRow`s became `null`s, and it causes the query fails even with permissive mode on. Now, we fail explicitly if `null` is passed when the input array contains `null`. Note that this is consistent with non-array JSON input: **Permissive mode:** ```scala spark.read.json(Seq("""{"a": "str"}""", """null""").toDS).collect() ``` ``` res0: Array[org.apache.spark.sql.Row] = Array([str], [null]) ``` **Failfast mode**: ```scala spark.read.option("mode", "failfast").json(Seq("""{"a": "str"}""", """null""").toDS).collect() ``` ``` org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'. at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:70) at org.apache.spark.sql.DataFrameReader.$anonfun$json$7(DataFrameReader.scala:540) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) ``` ### Why are the changes needed? To make the permissive mode to proceed and parse without throwing an exception. ### Does this PR introduce _any_ user-facing change? **Permissive mode:** ```scala spark.read.json(Seq("""[{"a": "str"}, null]""").toDS).collect() ``` Before: ``` java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) ``` After: ``` res0: Array[org.apache.spark.sql.Row] = Array([null]) ``` NOTE that this behaviour is consistent when JSON object is malformed: ```scala spark.read.schema("a int").json(Seq("""[{"a": 123}, {123123}, {"a": 123}]""").toDS).collect() ``` ``` res0: Array[org.apache.spark.sql.Row] = Array([null]) ``` Since we're parsing _one_ JSON array, related records all fail together. **Failfast mode:** ```scala spark.read.option("mode", "failfast").json(Seq("""[{"a": "str"}, null]""").toDS).collect() ``` Before: ``` java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) ``` After: ``` org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'. at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:70) at org.apache.spark.sql.DataFrameReader.$anonfun$json$7(DataFrameReader.scala:540) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) ``` ### How was this patch tested? Manually tested, and unit test was added. Closes #33608 from HyukjinKwon/SPARK-36379. Authored-by: Hyukjin Kwon Signed-off-by: Dongjoon Hyun --- .../org/apache/spark/sql/catalyst/json/JacksonParser.scala | 9 ++--- .../spark/sql/execution/datasources/json/JsonSuite.scala | 14 ++ 2 files changed, 20 insertions(+), 3 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala index 54cf251..dfa746f 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala @@
[spark] branch master updated: [SPARK-35430][K8S] Switch on "PVs with local storage" integration test on Docker driver
This is an automated email from the ASF dual-hosted git repository. shaneknapp pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 7b90fd2 [SPARK-35430][K8S] Switch on "PVs with local storage" integration test on Docker driver 7b90fd2 is described below commit 7b90fd2ca79b9a1fec5fca0bdcc169c7962ad880 Author: attilapiros AuthorDate: Mon Aug 2 09:17:29 2021 -0700 [SPARK-35430][K8S] Switch on "PVs with local storage" integration test on Docker driver ### What changes were proposed in this pull request? Switching back the "PVs with local storage" integration test on Docker driver. I have analyzed why this test was failing on my machine (I hope the root cause of the problem is OS agnostic). It failed because of the mounting of the host directory into the Minikube node using the `--uid=185` (Spark user user id): ``` $ minikube mount ${PVC_TESTS_HOST_PATH}:${PVC_TESTS_VM_PATH} --9p-version=9p2000.L --gid=0 --uid=185 &; MOUNT_PID=$! ``` Are referring to a nonexistent user. See the the number of occurence of 185 in "/etc/passwd": ``` $ minikube ssh "grep -c 185 /etc/passwd" 0 ``` This leads to a permission denied. Skipping the `--uid=185` won't help although the path will listable before the test execution: ``` ╭─attilazsoltpirosapiros-MBP16 ~/git/attilapiros/spark ‹SPARK-35430*› ╰─$ Mounting host path /var/folders/t_/fr_vqcyx23vftk81ftz1k5hwgn/T/tmp.k9X4Gecv into VM as /var/folders/t_/fr_vqcyx23vftk81ftz1k5hwgn/T/tmp.k9X4Gecv ... ▪ Mount type: ▪ User ID: docker ▪ Group ID: 0 ▪ Version: 9p2000.L ▪ Message Size: 262144 ▪ Permissions: 755 (-rwxr-xr-x) ▪ Options: map[] ▪ Bind Address: 127.0.0.1:51740 Userspace file server: ufs starting ╭─attilazsoltpirosapiros-MBP16 ~/git/attilapiros/spark ‹SPARK-35430*› ╰─$ minikube ssh "ls /var/folders/t_/fr_vqcyx23vftk81ftz1k5hwgn/T/tmp.k9X4Gecv" ╭─attilazsoltpirosapiros-MBP16 ~/git/attilapiros/spark ‹SPARK-35430*› ╰─$ ``` But the test will fail and after its execution the `dmesg` shows the following error: ``` [13670.493359] bpfilter: Loaded bpfilter_umh pid 66153 [13670.493363] bpfilter: write fail -32 [13670.530737] bpfilter: Loaded bpfilter_umh pid 66155 ... ``` This `bpfilter` is a firewall module and we are back to a permission denied when we want to list the mounted directory. The solution is to add a spark user with 185 uid when the minikube is started. **So this must be added to Jenkins job (and the mount should use --gid=0 --uid=185)**: ``` $ minikube ssh "sudo useradd spark -u 185 -g 0 -m -s /bin/bash" ``` ### Why are the changes needed? This integration test is needed to validate the PVs feature. ### Does this PR introduce _any_ user-facing change? No. It is just testing. ### How was this patch tested? Running the test locally: ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - All pods have the same service account by default - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - PVs with local storage ``` The "PVs with local storage" was successful but the next test `Launcher client dependencies` the minio stops the test executions on Mac (only on Mac): ``` 21/06/29 04:33:32.449 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: Starting tunnel for service minio-s3. 21/06/29 04:33:33.425 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: |--|--|-|| 21/06/29 04:33:33.426 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: |NAMESPACE | NAME | TARGET PORT | URL | 21/06/29 04:33:33.426 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: |--|--|-|| 21/06/29 04:33:33.426 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: |
[spark] branch branch-3.2 updated: [SPARK-36086][SQL] CollapseProject project replace alias should use origin column name
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new ea559ad [SPARK-36086][SQL] CollapseProject project replace alias should use origin column name ea559ad is described below commit ea559adc2ec1664939fcf1a23a194d4055ad9596 Author: Angerszh AuthorDate: Tue Aug 3 00:08:13 2021 +0800 [SPARK-36086][SQL] CollapseProject project replace alias should use origin column name ### What changes were proposed in this pull request? For added UT, without this patch will failed as below ``` [info] - SHOW TABLES V2: SPARK-36086: CollapseProject project replace alias should use origin column name *** FAILED *** (4 seconds, 935 milliseconds) [info] java.lang.RuntimeException: After applying rule org.apache.spark.sql.catalyst.optimizer.CollapseProject in batch Operator Optimization before Inferring Filters, the structural integrity of the plan is broken. [info] at org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1217) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:229) [info] at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) [info] at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) [info] at scala.collection.immutable.List.foldLeft(List.scala:91) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:208) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200) [info] at scala.collection.immutable.List.foreach(List.scala:431) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179) [info] at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88) ``` CollapseProject project replace alias should use origin column name ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #33576 from AngersZh/SPARK-36086. Authored-by: Angerszh Signed-off-by: Wenchen Fan (cherry picked from commit f3173956cbd64c056424b743aff8d17dd7c61fd7) Signed-off-by: Wenchen Fan --- .../org/apache/spark/sql/catalyst/expressions/AliasHelper.scala | 2 +- .../apache/spark/sql/catalyst/expressions/namedExpressions.scala | 8 .../spark/sql/catalyst/optimizer/CollapseProjectSuite.scala | 9 + .../approved-plans-v1_4/q5.sf100/explain.txt | 8 .../approved-plans-v1_4/q5.sf100/simplified.txt | 6 +++--- .../tpcds-plan-stability/approved-plans-v1_4/q5/explain.txt | 8 .../tpcds-plan-stability/approved-plans-v1_4/q5/simplified.txt | 6 +++--- 7 files changed, 32 insertions(+), 15 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala index 1f3f762..0007d38 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala @@ -72,7 +72,7 @@ trait AliasHelper { // Use transformUp to prevent infinite recursion when the replacement expression // redefines the same ExprId, trimNonTopLevelAliases(expr.transformUp { - case a: Attribute => aliasMap.getOrElse(a, a) + case a: Attribute => aliasMap.get(a).map(_.withName(a.name)).getOrElse(a) }).asInstanceOf[NamedExpression] } diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala index ae2c66c..71f193e 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala @@ -188,6 +188,14 @@ case class Alias(child: Expression, name: String)( } } + def withName(newName: String): NamedExpression = { +Alias(child, newName)( + exprId = exprId, + qualifier = qualifier, + explicitMetadata = explicitMetadata, + nonInheritableMetadataKeys = nonInheritableMetadataKeys) + } + def newInstance(): NamedExpression = Alias(child, name)( qualifier = qualifier, diff --git
[spark] branch master updated: [SPARK-36086][SQL] CollapseProject project replace alias should use origin column name
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f317395 [SPARK-36086][SQL] CollapseProject project replace alias should use origin column name f317395 is described below commit f3173956cbd64c056424b743aff8d17dd7c61fd7 Author: Angerszh AuthorDate: Tue Aug 3 00:08:13 2021 +0800 [SPARK-36086][SQL] CollapseProject project replace alias should use origin column name ### What changes were proposed in this pull request? For added UT, without this patch will failed as below ``` [info] - SHOW TABLES V2: SPARK-36086: CollapseProject project replace alias should use origin column name *** FAILED *** (4 seconds, 935 milliseconds) [info] java.lang.RuntimeException: After applying rule org.apache.spark.sql.catalyst.optimizer.CollapseProject in batch Operator Optimization before Inferring Filters, the structural integrity of the plan is broken. [info] at org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1217) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:229) [info] at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) [info] at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) [info] at scala.collection.immutable.List.foldLeft(List.scala:91) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:208) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200) [info] at scala.collection.immutable.List.foreach(List.scala:431) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179) [info] at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88) ``` CollapseProject project replace alias should use origin column name ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #33576 from AngersZh/SPARK-36086. Authored-by: Angerszh Signed-off-by: Wenchen Fan --- .../org/apache/spark/sql/catalyst/expressions/AliasHelper.scala | 2 +- .../apache/spark/sql/catalyst/expressions/namedExpressions.scala | 8 .../spark/sql/catalyst/optimizer/CollapseProjectSuite.scala | 9 + .../approved-plans-v1_4/q5.sf100/explain.txt | 8 .../approved-plans-v1_4/q5.sf100/simplified.txt | 6 +++--- .../tpcds-plan-stability/approved-plans-v1_4/q5/explain.txt | 8 .../tpcds-plan-stability/approved-plans-v1_4/q5/simplified.txt | 6 +++--- 7 files changed, 32 insertions(+), 15 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala index 1f3f762..0007d38 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala @@ -72,7 +72,7 @@ trait AliasHelper { // Use transformUp to prevent infinite recursion when the replacement expression // redefines the same ExprId, trimNonTopLevelAliases(expr.transformUp { - case a: Attribute => aliasMap.getOrElse(a, a) + case a: Attribute => aliasMap.get(a).map(_.withName(a.name)).getOrElse(a) }).asInstanceOf[NamedExpression] } diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala index ae2c66c..71f193e 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala @@ -188,6 +188,14 @@ case class Alias(child: Expression, name: String)( } } + def withName(newName: String): NamedExpression = { +Alias(child, newName)( + exprId = exprId, + qualifier = qualifier, + explicitMetadata = explicitMetadata, + nonInheritableMetadataKeys = nonInheritableMetadataKeys) + } + def newInstance(): NamedExpression = Alias(child, name)( qualifier = qualifier, diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala
[spark] branch master updated (2f70077 -> 366f7fe)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 2f70077 [SPARK-36224][SQL] Use Void as the type name of NullType add 366f7fe [SPARK-36382][WEBUI] Remove noisy footer from the summary table for metrics No new revisions were added by this update. Summary of changes: core/src/main/resources/org/apache/spark/ui/static/stagepage.js | 1 + 1 file changed, 1 insertion(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-36224][SQL] Use Void as the type name of NullType
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new e26cb96 [SPARK-36224][SQL] Use Void as the type name of NullType e26cb96 is described below commit e26cb968bdaff1cce1d5c050226eac1d01e3e947 Author: Linhong Liu AuthorDate: Mon Aug 2 23:19:54 2021 +0800 [SPARK-36224][SQL] Use Void as the type name of NullType ### What changes were proposed in this pull request? Change the `NullType.simpleString` to "void" to set "void" as the formal type name of `NullType` ### Why are the changes needed? This PR is intended to address the type name discussion in PR #28833. Here are the reasons: 1. The type name of NullType is displayed everywhere, e.g. schema string, error message, document. Hence it's not possible to hide it from users, we have to choose a proper name 2. The "void" is widely used as the type name of "NULL", e.g. Hive, pgSQL 3. Changing to "void" can enable the round trip of `toDDL`/`fromDDL` for NullType. (i.e. make `from_json(col, schema.toDDL)`) work ### Does this PR introduce _any_ user-facing change? Yes, the type name of "NULL" is changed from "null" to "void". for example: ``` scala> sql("select null as a, 1 as b").schema.catalogString res5: String = struct ``` ### How was this patch tested? existing test cases Closes #33437 from linhongliu-db/SPARK-36224-void-type-name. Authored-by: Linhong Liu Signed-off-by: Wenchen Fan (cherry picked from commit 2f700773c2e8fac26661d0aa8024253556a921ba) Signed-off-by: Wenchen Fan --- python/pyspark/sql/tests/test_types.py | 3 +-- python/pyspark/sql/types.py | 4 +++- .../scala/org/apache/spark/sql/types/DataType.scala | 2 ++ .../scala/org/apache/spark/sql/types/NullType.scala | 2 ++ .../org/apache/spark/sql/types/DataTypeSuite.scala | 6 ++ .../sql-functions/sql-expression-schema.md | 6 +++--- .../sql-tests/results/ansi/literals.sql.out | 2 +- .../sql-tests/results/ansi/string-functions.sql.out | 4 ++-- .../sql-tests/results/inline-table.sql.out | 2 +- .../resources/sql-tests/results/literals.sql.out| 2 +- .../sql-tests/results/misc-functions.sql.out| 4 ++-- .../sql-tests/results/postgreSQL/select.sql.out | 4 ++-- .../sql-tests/results/postgreSQL/text.sql.out | 6 +++--- .../results/sql-compatibility-functions.sql.out | 6 +++--- .../results/table-valued-functions.sql.out | 2 +- .../sql-tests/results/udf/udf-inline-table.sql.out | 2 +- .../apache/spark/sql/FileBasedDataSourceSuite.scala | 2 +- .../SparkExecuteStatementOperation.scala| 1 - .../spark/sql/hive/client/HiveClientImpl.scala | 21 + .../spark/sql/hive/execution/HiveDDLSuite.scala | 12 ++-- .../spark/sql/hive/orc/HiveOrcSourceSuite.scala | 2 +- 21 files changed, 43 insertions(+), 52 deletions(-) diff --git a/python/pyspark/sql/tests/test_types.py b/python/pyspark/sql/tests/test_types.py index eb4caf0..33fe785 100644 --- a/python/pyspark/sql/tests/test_types.py +++ b/python/pyspark/sql/tests/test_types.py @@ -496,8 +496,7 @@ class TypesTests(ReusedSQLTestCase): def test_parse_datatype_string(self): from pyspark.sql.types import _all_atomic_types, _parse_datatype_string for k, t in _all_atomic_types.items(): -if t != NullType: -self.assertEqual(t(), _parse_datatype_string(k)) +self.assertEqual(t(), _parse_datatype_string(k)) self.assertEqual(IntegerType(), _parse_datatype_string("int")) self.assertEqual(DecimalType(1, 1), _parse_datatype_string("decimal(1 ,1)")) self.assertEqual(DecimalType(10, 1), _parse_datatype_string("decimal( 10,1 )")) diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py index 4b5632b..5e3398b 100644 --- a/python/pyspark/sql/types.py +++ b/python/pyspark/sql/types.py @@ -107,7 +107,9 @@ class NullType(DataType, metaclass=DataTypeSingleton): The data type representing None, used for the types that cannot be inferred. """ -pass +@classmethod +def typeName(cls): +return 'void' class AtomicType(DataType): diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala index ff6a49a..585045d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala @@ -195,6 +195,8 @@ object DataType { case FIXED_DECIMAL(precision, scale) => DecimalType(precision.toInt, scale.toInt) case CHAR_TYPE(length) => CharType(length.toInt)
[spark] branch master updated (a98d919 -> 2f70077)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from a98d919 [SPARK-36206][CORE] Support shuffle data corruption diagnosis via shuffle checksum add 2f70077 [SPARK-36224][SQL] Use Void as the type name of NullType No new revisions were added by this update. Summary of changes: python/pyspark/sql/tests/test_types.py | 3 +-- python/pyspark/sql/types.py | 4 +++- .../scala/org/apache/spark/sql/types/DataType.scala | 2 ++ .../scala/org/apache/spark/sql/types/NullType.scala | 2 ++ .../org/apache/spark/sql/types/DataTypeSuite.scala | 6 ++ .../sql-functions/sql-expression-schema.md | 6 +++--- .../sql-tests/results/ansi/literals.sql.out | 2 +- .../sql-tests/results/ansi/string-functions.sql.out | 4 ++-- .../sql-tests/results/inline-table.sql.out | 2 +- .../resources/sql-tests/results/literals.sql.out| 2 +- .../sql-tests/results/misc-functions.sql.out| 4 ++-- .../sql-tests/results/postgreSQL/select.sql.out | 4 ++-- .../sql-tests/results/postgreSQL/text.sql.out | 6 +++--- .../results/sql-compatibility-functions.sql.out | 6 +++--- .../results/table-valued-functions.sql.out | 2 +- .../sql-tests/results/udf/udf-inline-table.sql.out | 2 +- .../apache/spark/sql/FileBasedDataSourceSuite.scala | 2 +- .../SparkExecuteStatementOperation.scala| 1 - .../spark/sql/hive/client/HiveClientImpl.scala | 21 + .../spark/sql/hive/execution/HiveDDLSuite.scala | 12 ++-- .../spark/sql/hive/orc/HiveOrcSourceSuite.scala | 2 +- 21 files changed, 43 insertions(+), 52 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-36206][CORE] Support shuffle data corruption diagnosis via shuffle checksum
This is an automated email from the ASF dual-hosted git repository. mridulm80 pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new df43300 [SPARK-36206][CORE] Support shuffle data corruption diagnosis via shuffle checksum df43300 is described below commit df433002277911c2e66bcf00a17b636a317b4a78 Author: yi.wu AuthorDate: Mon Aug 2 09:58:36 2021 -0500 [SPARK-36206][CORE] Support shuffle data corruption diagnosis via shuffle checksum ### What changes were proposed in this pull request? This PR adds support to diagnose shuffle data corruption. Basically, the diagnosis mechanism works like this: The shuffler reader would calculate the checksum (c1) for the corrupted shuffle block and send it to the server where the block is stored. At the server, it would read back the checksum (c2) that is stored in the checksum file and recalculate the checksum (c3) for the corresponding shuffle block. Then, if c2 != c3, we suspect the corruption is caused by the disk issue. Otherwise, if c1 != c3, we suspect the corruption is caused by the network issue. Otherwise, the checksum verifies pa [...] After the shuffle reader receives the diagnosis response, it'd take the action bases on the type of cause. Only in case of the network issue, we'd give a retry. Otherwise, we'd throw the fetch failure directly. Also note that, if the corruption happens inside BufferReleasingInputStream, the reducer will throw the fetch failure immediately no matter what the cause is since the data has been partially consumed by downstream RDDs. If corruption happens again after retry, the reducer will [...] Please check out https://github.com/apache/spark/pull/32385 to see the completed proposal of the shuffle checksum project. ### Why are the changes needed? Shuffle data corruption is a long-standing issue in Spark. For example, in SPARK-18105, people continually reports corruption issue. However, data corruption is difficult to reproduce in most cases and even harder to tell the root cause. We don't know if it's a Spark issue or not. With the diagnosis support for the shuffle corruption, Spark itself can at least distinguish the cause between disk and network, which is very important for users. ### Does this PR introduce _any_ user-facing change? Yes, users may know the cause of the shuffle corruption after this change. ### How was this patch tested? Added tests. Closes #33451 from Ngone51/SPARK-36206. Authored-by: yi.wu Signed-off-by: Mridul Muralidharan gmail.com> (cherry picked from commit a98d919da470abaf2e99060f99007a5373032fe1) Signed-off-by: Mridul Muralidharan --- .../spark/network/shuffle/BlockStoreClient.java| 45 +- .../network/shuffle/ExternalBlockHandler.java | 13 +- .../network/shuffle/ExternalBlockStoreClient.java | 22 +-- .../shuffle/ExternalShuffleBlockResolver.java | 25 .../spark/network/shuffle/checksum/Cause.java | 25 .../shuffle/checksum/ShuffleChecksumHelper.java| 158 + .../shuffle/protocol/BlockTransferMessage.java | 4 +- .../network/shuffle/protocol/CorruptionCause.java | 74 ++ .../shuffle/protocol/DiagnoseCorruption.java | 131 + .../network/shuffle/ExternalBlockHandlerSuite.java | 115 ++- .../shuffle/checksum/ShuffleChecksumHelper.java| 100 - .../shuffle/checksum/ShuffleChecksumSupport.java | 45 ++ .../shuffle/sort/BypassMergeSortShuffleWriter.java | 15 +- .../spark/shuffle/sort/ShuffleExternalSorter.java | 9 +- .../spark/shuffle/sort/UnsafeShuffleWriter.java| 2 +- .../org/apache/spark/internal/config/package.scala | 4 +- .../apache/spark/network/BlockDataManager.scala| 9 ++ .../spark/network/netty/NettyBlockRpcServer.scala | 7 + .../network/netty/NettyBlockTransferService.scala | 3 +- .../spark/shuffle/BlockStoreShuffleReader.scala| 2 + .../spark/shuffle/IndexShuffleBlockResolver.scala | 14 +- .../scala/org/apache/spark/storage/BlockId.scala | 2 +- .../org/apache/spark/storage/BlockManager.scala| 25 +++- .../storage/ShuffleBlockFetcherIterator.scala | 129 +++-- .../spark/util/collection/ExternalSorter.scala | 9 +- .../shuffle/sort/UnsafeShuffleWriterSuite.java | 25 ++-- .../test/scala/org/apache/spark/ShuffleSuite.scala | 37 - .../spark/shuffle/ShuffleChecksumTestHelper.scala | 11 +- .../sort/BypassMergeSortShuffleWriterSuite.scala | 13 +- .../sort/IndexShuffleBlockResolverSuite.scala | 2 +- .../shuffle/sort/SortShuffleWriterSuite.scala | 6 +- .../storage/ShuffleBlockFetcherIteratorSuite.scala | 82 ++- 32 files changed, 973 insertions(+), 190 deletions(-) diff --git
[spark] branch master updated: [SPARK-36206][CORE] Support shuffle data corruption diagnosis via shuffle checksum
This is an automated email from the ASF dual-hosted git repository. mridulm80 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a98d919 [SPARK-36206][CORE] Support shuffle data corruption diagnosis via shuffle checksum a98d919 is described below commit a98d919da470abaf2e99060f99007a5373032fe1 Author: yi.wu AuthorDate: Mon Aug 2 09:58:36 2021 -0500 [SPARK-36206][CORE] Support shuffle data corruption diagnosis via shuffle checksum ### What changes were proposed in this pull request? This PR adds support to diagnose shuffle data corruption. Basically, the diagnosis mechanism works like this: The shuffler reader would calculate the checksum (c1) for the corrupted shuffle block and send it to the server where the block is stored. At the server, it would read back the checksum (c2) that is stored in the checksum file and recalculate the checksum (c3) for the corresponding shuffle block. Then, if c2 != c3, we suspect the corruption is caused by the disk issue. Otherwise, if c1 != c3, we suspect the corruption is caused by the network issue. Otherwise, the checksum verifies pa [...] After the shuffle reader receives the diagnosis response, it'd take the action bases on the type of cause. Only in case of the network issue, we'd give a retry. Otherwise, we'd throw the fetch failure directly. Also note that, if the corruption happens inside BufferReleasingInputStream, the reducer will throw the fetch failure immediately no matter what the cause is since the data has been partially consumed by downstream RDDs. If corruption happens again after retry, the reducer will [...] Please check out https://github.com/apache/spark/pull/32385 to see the completed proposal of the shuffle checksum project. ### Why are the changes needed? Shuffle data corruption is a long-standing issue in Spark. For example, in SPARK-18105, people continually reports corruption issue. However, data corruption is difficult to reproduce in most cases and even harder to tell the root cause. We don't know if it's a Spark issue or not. With the diagnosis support for the shuffle corruption, Spark itself can at least distinguish the cause between disk and network, which is very important for users. ### Does this PR introduce _any_ user-facing change? Yes, users may know the cause of the shuffle corruption after this change. ### How was this patch tested? Added tests. Closes #33451 from Ngone51/SPARK-36206. Authored-by: yi.wu Signed-off-by: Mridul Muralidharan gmail.com> --- .../spark/network/shuffle/BlockStoreClient.java| 45 +- .../network/shuffle/ExternalBlockHandler.java | 13 +- .../network/shuffle/ExternalBlockStoreClient.java | 22 +-- .../shuffle/ExternalShuffleBlockResolver.java | 25 .../spark/network/shuffle/checksum/Cause.java | 25 .../shuffle/checksum/ShuffleChecksumHelper.java| 158 + .../shuffle/protocol/BlockTransferMessage.java | 4 +- .../network/shuffle/protocol/CorruptionCause.java | 74 ++ .../shuffle/protocol/DiagnoseCorruption.java | 131 + .../network/shuffle/ExternalBlockHandlerSuite.java | 115 ++- .../shuffle/checksum/ShuffleChecksumHelper.java| 100 - .../shuffle/checksum/ShuffleChecksumSupport.java | 45 ++ .../shuffle/sort/BypassMergeSortShuffleWriter.java | 15 +- .../spark/shuffle/sort/ShuffleExternalSorter.java | 9 +- .../spark/shuffle/sort/UnsafeShuffleWriter.java| 2 +- .../org/apache/spark/internal/config/package.scala | 4 +- .../apache/spark/network/BlockDataManager.scala| 9 ++ .../spark/network/netty/NettyBlockRpcServer.scala | 7 + .../network/netty/NettyBlockTransferService.scala | 3 +- .../spark/shuffle/BlockStoreShuffleReader.scala| 2 + .../spark/shuffle/IndexShuffleBlockResolver.scala | 14 +- .../scala/org/apache/spark/storage/BlockId.scala | 2 +- .../org/apache/spark/storage/BlockManager.scala| 25 +++- .../storage/ShuffleBlockFetcherIterator.scala | 129 +++-- .../spark/util/collection/ExternalSorter.scala | 9 +- .../shuffle/sort/UnsafeShuffleWriterSuite.java | 25 ++-- .../test/scala/org/apache/spark/ShuffleSuite.scala | 37 - .../spark/shuffle/ShuffleChecksumTestHelper.scala | 11 +- .../sort/BypassMergeSortShuffleWriterSuite.scala | 13 +- .../sort/IndexShuffleBlockResolverSuite.scala | 2 +- .../shuffle/sort/SortShuffleWriterSuite.scala | 6 +- .../storage/ShuffleBlockFetcherIteratorSuite.scala | 82 ++- 32 files changed, 973 insertions(+), 190 deletions(-) diff --git a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/BlockStoreClient.java
[spark] branch master updated (951efb8 -> be06e41)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 951efb8 [SPARK-36237][UI][SQL] Attach and start handler after application started in UI add be06e41 [SPARK-35918][AVRO] Unify schema mismatch handling for read/write and enhance error messages No new revisions were added by this update. Summary of changes: .../apache/spark/sql/avro/AvroDeserializer.scala | 54 ++-- .../org/apache/spark/sql/avro/AvroSerializer.scala | 35 +++--- .../org/apache/spark/sql/avro/AvroUtils.scala | 75 +- .../spark/sql/avro/AvroSchemaHelperSuite.scala | 59 +++-- .../org/apache/spark/sql/avro/AvroSerdeSuite.scala | 46 +++-- .../org/apache/spark/sql/avro/AvroSuite.scala | 3 +- 6 files changed, 173 insertions(+), 99 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (3b713e7 -> 951efb8)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 3b713e7 [SPARK-36372][SQL] v2 ALTER TABLE ADD COLUMNS should check duplicates for the user specified columns add 951efb8 [SPARK-36237][UI][SQL] Attach and start handler after application started in UI No new revisions were added by this update. Summary of changes: .../main/scala/org/apache/spark/SparkContext.scala | 7 ++-- .../main/scala/org/apache/spark/TestUtils.scala| 13 +++ .../main/scala/org/apache/spark/ui/SparkUI.scala | 40 ++ .../src/main/scala/org/apache/spark/ui/WebUI.scala | 13 --- .../test/scala/org/apache/spark/ui/UISuite.scala | 22 5 files changed, 89 insertions(+), 6 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-36372][SQL] v2 ALTER TABLE ADD COLUMNS should check duplicates for the user specified columns
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 87ae397 [SPARK-36372][SQL] v2 ALTER TABLE ADD COLUMNS should check duplicates for the user specified columns 87ae397 is described below commit 87ae3978971c46a17f8a550f9d3e6934a74cc3a4 Author: Terry Kim AuthorDate: Mon Aug 2 17:54:50 2021 +0800 [SPARK-36372][SQL] v2 ALTER TABLE ADD COLUMNS should check duplicates for the user specified columns ### What changes were proposed in this pull request? Currently, v2 ALTER TABLE ADD COLUMNS does not check duplicates for the user specified columns. For example, ``` spark.sql(s"CREATE TABLE $t (id int) USING $v2Format") spark.sql("ALTER TABLE $t ADD COLUMNS (data string, data string)") ``` doesn't fail the analysis, and it's up to the catalog implementation to handle it. For v1 command, the duplication is checked before invoking the catalog. ### Why are the changes needed? To check the duplicate columns during analysis and be consistent with v1 command. ### Does this PR introduce _any_ user-facing change? Yes, now the above will command will print out the fllowing: ``` org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the user specified columns: `data` ``` ### How was this patch tested? Added new unit tests Closes #33600 from imback82/alter_add_duplicate_columns. Authored-by: Terry Kim Signed-off-by: Wenchen Fan (cherry picked from commit 3b713e7f6189dfe1c5bbb1a527bf1266bde69f69) Signed-off-by: Wenchen Fan --- .../sql/catalyst/analysis/CheckAnalysis.scala | 5 .../spark/sql/connector/AlterTableTests.scala | 23 .../connector/V2CommandsCaseSensitivitySuite.scala | 32 -- 3 files changed, 57 insertions(+), 3 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala index 2d8ac64..77f721c 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala @@ -30,6 +30,7 @@ import org.apache.spark.sql.connector.catalog.{LookupCatalog, SupportsPartitionM import org.apache.spark.sql.errors.{QueryCompilationErrors, QueryExecutionErrors} import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types._ +import org.apache.spark.sql.util.SchemaUtils /** * Throws user facing errors when passed invalid queries that fail to analyze. @@ -951,6 +952,10 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog { colsToAdd.foreach { colToAdd => checkColumnNotExists("add", colToAdd.name, table.schema) } +SchemaUtils.checkColumnNameDuplication( + colsToAdd.map(_.name.quoted), + "in the user specified columns", + alter.conf.resolver) case AlterTableRenameColumn(table: ResolvedTable, col: ResolvedFieldName, newName) => checkColumnNotExists("rename", col.path :+ newName, table.schema) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala b/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala index 004a64a..1bd45f5 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala @@ -384,6 +384,29 @@ trait AlterTableTests extends SharedSparkSession { } } + test("SPARK-36372: Adding duplicate columns should not be allowed") { +val t = s"${catalogAndNamespace}table_name" +withTable(t) { + sql(s"CREATE TABLE $t (id int) USING $v2Format") + val e = intercept[AnalysisException] { +sql(s"ALTER TABLE $t ADD COLUMNS (data string, data1 string, data string)") + } + assert(e.message.contains("Found duplicate column(s) in the user specified columns: `data`")) +} + } + + test("SPARK-36372: Adding duplicate nested columns should not be allowed") { +val t = s"${catalogAndNamespace}table_name" +withTable(t) { + sql(s"CREATE TABLE $t (id int, point struct) USING $v2Format") + val e = intercept[AnalysisException] { +sql(s"ALTER TABLE $t ADD COLUMNS (point.z double, point.z double, point.xx double)") + } + assert(e.message.contains( +"Found duplicate column(s) in the user specified columns: `point.z`")) +} + } + test("AlterTable: update column type int -> long") { val t = s"${catalogAndNamespace}table_name" withTable(t) { diff --git
[spark] branch master updated: [SPARK-36372][SQL] v2 ALTER TABLE ADD COLUMNS should check duplicates for the user specified columns
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 3b713e7 [SPARK-36372][SQL] v2 ALTER TABLE ADD COLUMNS should check duplicates for the user specified columns 3b713e7 is described below commit 3b713e7f6189dfe1c5bbb1a527bf1266bde69f69 Author: Terry Kim AuthorDate: Mon Aug 2 17:54:50 2021 +0800 [SPARK-36372][SQL] v2 ALTER TABLE ADD COLUMNS should check duplicates for the user specified columns ### What changes were proposed in this pull request? Currently, v2 ALTER TABLE ADD COLUMNS does not check duplicates for the user specified columns. For example, ``` spark.sql(s"CREATE TABLE $t (id int) USING $v2Format") spark.sql("ALTER TABLE $t ADD COLUMNS (data string, data string)") ``` doesn't fail the analysis, and it's up to the catalog implementation to handle it. For v1 command, the duplication is checked before invoking the catalog. ### Why are the changes needed? To check the duplicate columns during analysis and be consistent with v1 command. ### Does this PR introduce _any_ user-facing change? Yes, now the above will command will print out the fllowing: ``` org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the user specified columns: `data` ``` ### How was this patch tested? Added new unit tests Closes #33600 from imback82/alter_add_duplicate_columns. Authored-by: Terry Kim Signed-off-by: Wenchen Fan --- .../sql/catalyst/analysis/CheckAnalysis.scala | 5 .../spark/sql/connector/AlterTableTests.scala | 23 .../connector/V2CommandsCaseSensitivitySuite.scala | 32 -- 3 files changed, 57 insertions(+), 3 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala index 2d8ac64..77f721c 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala @@ -30,6 +30,7 @@ import org.apache.spark.sql.connector.catalog.{LookupCatalog, SupportsPartitionM import org.apache.spark.sql.errors.{QueryCompilationErrors, QueryExecutionErrors} import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types._ +import org.apache.spark.sql.util.SchemaUtils /** * Throws user facing errors when passed invalid queries that fail to analyze. @@ -951,6 +952,10 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog { colsToAdd.foreach { colToAdd => checkColumnNotExists("add", colToAdd.name, table.schema) } +SchemaUtils.checkColumnNameDuplication( + colsToAdd.map(_.name.quoted), + "in the user specified columns", + alter.conf.resolver) case AlterTableRenameColumn(table: ResolvedTable, col: ResolvedFieldName, newName) => checkColumnNotExists("rename", col.path :+ newName, table.schema) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala b/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala index 004a64a..1bd45f5 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala @@ -384,6 +384,29 @@ trait AlterTableTests extends SharedSparkSession { } } + test("SPARK-36372: Adding duplicate columns should not be allowed") { +val t = s"${catalogAndNamespace}table_name" +withTable(t) { + sql(s"CREATE TABLE $t (id int) USING $v2Format") + val e = intercept[AnalysisException] { +sql(s"ALTER TABLE $t ADD COLUMNS (data string, data1 string, data string)") + } + assert(e.message.contains("Found duplicate column(s) in the user specified columns: `data`")) +} + } + + test("SPARK-36372: Adding duplicate nested columns should not be allowed") { +val t = s"${catalogAndNamespace}table_name" +withTable(t) { + sql(s"CREATE TABLE $t (id int, point struct) USING $v2Format") + val e = intercept[AnalysisException] { +sql(s"ALTER TABLE $t ADD COLUMNS (point.z double, point.z double, point.xx double)") + } + assert(e.message.contains( +"Found duplicate column(s) in the user specified columns: `point.z`")) +} + } + test("AlterTable: update column type int -> long") { val t = s"${catalogAndNamespace}table_name" withTable(t) { diff --git a/sql/core/src/test/scala/org/apache/spark/sql/connector/V2CommandsCaseSensitivitySuite.scala