[spark] branch master updated (00b63c8 -> bdd8e1d)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 00b63c8 [SPARK-27991][CORE] Defer the fetch request on Netty OOM add bdd8e1d [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory No new revisions were added by this update. Summary of changes: docs/sql-migration-guide.md| 2 ++ .../org/apache/spark/sql/internal/SQLConf.scala| 12 .../sql/execution/command/DataWritingCommand.scala | 28 +- .../execution/command/createDataSourceTables.scala | 4 +++ .../sql/sources/CreateTableAsSelectSuite.scala | 17 ++- .../execution/CreateHiveTableAsSelectCommand.scala | 4 +++ .../spark/sql/hive/execution/SQLQuerySuite.scala | 33 ++ 7 files changed, 92 insertions(+), 8 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (de59e01 -> 00b63c8)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from de59e01 [SPARK-35443][K8S] Mark K8s ConfigMaps and Secrets created by Spark as immutable add 00b63c8 [SPARK-27991][CORE] Defer the fetch request on Netty OOM No new revisions were added by this update. Summary of changes: .../org/apache/spark/network/util/NettyUtils.java | 4 + .../executor/CoarseGrainedExecutorBackend.scala| 9 ++ .../org/apache/spark/internal/config/package.scala | 9 ++ .../spark/shuffle/BlockStoreShuffleReader.scala| 1 + .../storage/ShuffleBlockFetcherIterator.scala | 153 ++--- .../storage/ShuffleBlockFetcherIteratorSuite.scala | 132 ++ 6 files changed, 291 insertions(+), 17 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (a970f85 -> de59e01)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from a970f85 [SPARK-35338][PYTHON] Separate arithmetic operations into data type based structures add de59e01 [SPARK-35443][K8S] Mark K8s ConfigMaps and Secrets created by Spark as immutable No new revisions were added by this update. Summary of changes: .../deploy/k8s/features/DriverKubernetesCredentialsFeatureStep.scala | 1 + .../apache/spark/deploy/k8s/features/HadoopConfDriverFeatureStep.scala | 1 + .../spark/deploy/k8s/features/KerberosConfDriverFeatureStep.scala | 3 +++ .../apache/spark/deploy/k8s/features/PodTemplateConfigMapStep.scala| 1 + .../org/apache/spark/deploy/k8s/submit/KubernetesClientUtils.scala | 1 + .../spark/deploy/k8s/features/PodTemplateConfigMapStepSuite.scala | 1 + .../test/scala/org/apache/spark/deploy/k8s/submit/ClientSuite.scala| 1 + 7 files changed, 9 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r47830 - /release/spark/spark-2.4.7/
Author: dongjoon Date: Thu May 20 03:14:53 2021 New Revision: 47830 Log: Remove Apache Spark 2.4.7 after 2.4.8 release Removed: release/spark/spark-2.4.7/ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (7eaabf4 -> a970f85)
This is an automated email from the ASF dual-hosted git repository. ueshin pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 7eaabf4 [SPARK-35408][PYTHON][FOLLOW-UP] Avoid unnecessary f-string format add a970f85 [SPARK-35338][PYTHON] Separate arithmetic operations into data type based structures No new revisions were added by this update. Summary of changes: dev/sparktestsupport/modules.py| 6 + python/pyspark/pandas/base.py | 268 ++- .../tests => pandas/data_type_ops}/__init__.py | 0 python/pyspark/pandas/data_type_ops/base.py| 120 +++ python/pyspark/pandas/data_type_ops/boolean_ops.py | 28 ++ .../pandas/data_type_ops/categorical_ops.py| 28 ++ python/pyspark/pandas/data_type_ops/date_ops.py| 71 .../pyspark/pandas/data_type_ops/datetime_ops.py | 72 python/pyspark/pandas/data_type_ops/num_ops.py | 378 + python/pyspark/pandas/data_type_ops/string_ops.py | 104 ++ .../tests/data_type_ops}/__init__.py | 0 .../pandas/tests/data_type_ops/test_boolean_ops.py | 150 .../tests/data_type_ops/test_categorical_ops.py| 128 +++ .../pandas/tests/data_type_ops/test_date_ops.py| 158 + .../tests/data_type_ops/test_datetime_ops.py | 160 + .../pandas/tests/data_type_ops/test_num_ops.py | 195 +++ .../pandas/tests/data_type_ops/test_string_ops.py | 140 .../pandas/tests/data_type_ops/testing_utils.py| 75 .../pyspark/pandas/tests/indexes/test_datetime.py | 10 +- python/pyspark/pandas/tests/test_dataframe.py | 4 +- .../pyspark/pandas/tests/test_series_datetime.py | 10 +- python/pyspark/testing/pandasutils.py | 2 +- python/setup.py| 1 + 23 files changed, 1849 insertions(+), 259 deletions(-) copy python/pyspark/{streaming/tests => pandas/data_type_ops}/__init__.py (100%) create mode 100644 python/pyspark/pandas/data_type_ops/base.py create mode 100644 python/pyspark/pandas/data_type_ops/boolean_ops.py create mode 100644 python/pyspark/pandas/data_type_ops/categorical_ops.py create mode 100644 python/pyspark/pandas/data_type_ops/date_ops.py create mode 100644 python/pyspark/pandas/data_type_ops/datetime_ops.py create mode 100644 python/pyspark/pandas/data_type_ops/num_ops.py create mode 100644 python/pyspark/pandas/data_type_ops/string_ops.py copy python/pyspark/{streaming/tests => pandas/tests/data_type_ops}/__init__.py (100%) create mode 100644 python/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py create mode 100644 python/pyspark/pandas/tests/data_type_ops/test_categorical_ops.py create mode 100644 python/pyspark/pandas/tests/data_type_ops/test_date_ops.py create mode 100644 python/pyspark/pandas/tests/data_type_ops/test_datetime_ops.py create mode 100644 python/pyspark/pandas/tests/data_type_ops/test_num_ops.py create mode 100644 python/pyspark/pandas/tests/data_type_ops/test_string_ops.py create mode 100644 python/pyspark/pandas/tests/data_type_ops/testing_utils.py - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (c064805 -> 7eaabf4)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from c064805 [SPARK-35450][INFRA] Follow checkout-merge way to use the latest commit for linter, or other workflows add 7eaabf4 [SPARK-35408][PYTHON][FOLLOW-UP] Avoid unnecessary f-string format No new revisions were added by this update. Summary of changes: python/pyspark/sql/dataframe.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (d44e6c7 -> c064805)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from d44e6c7 Revert "[SPARK-35338][PYTHON] Separate arithmetic operations into data type based structures" add c064805 [SPARK-35450][INFRA] Follow checkout-merge way to use the latest commit for linter, or other workflows No new revisions were added by this update. Summary of changes: .github/workflows/build_and_test.yml | 55 1 file changed, 55 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (586caae -> d44e6c7)
This is an automated email from the ASF dual-hosted git repository. ueshin pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 586caae [SPARK-35438][SQL][DOCS] Minor documentation fix for window physical operator add d44e6c7 Revert "[SPARK-35338][PYTHON] Separate arithmetic operations into data type based structures" No new revisions were added by this update. Summary of changes: dev/sparktestsupport/modules.py| 6 - python/pyspark/pandas/base.py | 268 +-- python/pyspark/pandas/data_type_ops/__init__.py| 16 - python/pyspark/pandas/data_type_ops/base.py| 120 --- python/pyspark/pandas/data_type_ops/boolean_ops.py | 28 -- .../pandas/data_type_ops/categorical_ops.py| 28 -- python/pyspark/pandas/data_type_ops/date_ops.py| 71 .../pyspark/pandas/data_type_ops/datetime_ops.py | 72 python/pyspark/pandas/data_type_ops/num_ops.py | 378 - python/pyspark/pandas/data_type_ops/string_ops.py | 104 -- .../pyspark/pandas/tests/data_type_ops/__init__.py | 16 - .../pandas/tests/data_type_ops/test_boolean_ops.py | 150 .../tests/data_type_ops/test_categorical_ops.py| 128 --- .../pandas/tests/data_type_ops/test_date_ops.py| 158 - .../tests/data_type_ops/test_datetime_ops.py | 160 - .../pandas/tests/data_type_ops/test_num_ops.py | 195 --- .../pandas/tests/data_type_ops/test_string_ops.py | 140 .../pandas/tests/data_type_ops/testing_utils.py| 75 .../pyspark/pandas/tests/indexes/test_datetime.py | 10 +- python/pyspark/pandas/tests/test_dataframe.py | 4 +- .../pyspark/pandas/tests/test_series_datetime.py | 10 +- python/pyspark/testing/pandasutils.py | 2 +- python/setup.py| 1 - 23 files changed, 259 insertions(+), 1881 deletions(-) delete mode 100644 python/pyspark/pandas/data_type_ops/__init__.py delete mode 100644 python/pyspark/pandas/data_type_ops/base.py delete mode 100644 python/pyspark/pandas/data_type_ops/boolean_ops.py delete mode 100644 python/pyspark/pandas/data_type_ops/categorical_ops.py delete mode 100644 python/pyspark/pandas/data_type_ops/date_ops.py delete mode 100644 python/pyspark/pandas/data_type_ops/datetime_ops.py delete mode 100644 python/pyspark/pandas/data_type_ops/num_ops.py delete mode 100644 python/pyspark/pandas/data_type_ops/string_ops.py delete mode 100644 python/pyspark/pandas/tests/data_type_ops/__init__.py delete mode 100644 python/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py delete mode 100644 python/pyspark/pandas/tests/data_type_ops/test_categorical_ops.py delete mode 100644 python/pyspark/pandas/tests/data_type_ops/test_date_ops.py delete mode 100644 python/pyspark/pandas/tests/data_type_ops/test_datetime_ops.py delete mode 100644 python/pyspark/pandas/tests/data_type_ops/test_num_ops.py delete mode 100644 python/pyspark/pandas/tests/data_type_ops/test_string_ops.py delete mode 100644 python/pyspark/pandas/tests/data_type_ops/testing_utils.py - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (d1b24d8 -> 586caae)
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from d1b24d8 [SPARK-35338][PYTHON] Separate arithmetic operations into data type based structures add 586caae [SPARK-35438][SQL][DOCS] Minor documentation fix for window physical operator No new revisions were added by this update. Summary of changes: .../main/scala/org/apache/spark/sql/execution/window/WindowExec.scala | 2 +- .../scala/org/apache/spark/sql/execution/window/WindowExecBase.scala| 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-35338][PYTHON] Separate arithmetic operations into data type based structures
This is an automated email from the ASF dual-hosted git repository. ueshin pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d1b24d8 [SPARK-35338][PYTHON] Separate arithmetic operations into data type based structures d1b24d8 is described below commit d1b24d8aba8317c62542a81ca55c12700a07cb80 Author: Xinrong Meng AuthorDate: Wed May 19 15:05:32 2021 -0700 [SPARK-35338][PYTHON] Separate arithmetic operations into data type based structures ### What changes were proposed in this pull request? The PR is proposed for **pandas APIs on Spark**, in order to separate arithmetic operations shown as below into data-type-based structures. `__add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__, __radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__` DataTypeOps and subclasses are introduced. The existing behaviors of each arithmetic operation should be preserved. ### Why are the changes needed? Currently, the same arithmetic operation of all data types is defined in one function, so it’s difficult to extend the behavior change based on the data types. Introducing DataTypeOps would be the foundation for [pandas APIs on Spark: Separate basic operations into data type based structures.](https://docs.google.com/document/d/12MS6xK0hETYmrcl5b9pX5lgV4FmGVfpmcSKq--_oQlc/edit?usp=sharing). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests are introduced under pyspark.pandas.tests.data_type_ops. One test file per DataTypeOps class. Closes #32469 from xinrong-databricks/datatypeop_arith. Authored-by: Xinrong Meng Signed-off-by: Takuya UESHIN --- dev/sparktestsupport/modules.py| 6 + python/pyspark/pandas/base.py | 268 ++- python/pyspark/pandas/data_type_ops/__init__.py| 16 + python/pyspark/pandas/data_type_ops/base.py| 120 +++ python/pyspark/pandas/data_type_ops/boolean_ops.py | 28 ++ .../pandas/data_type_ops/categorical_ops.py| 28 ++ python/pyspark/pandas/data_type_ops/date_ops.py| 71 .../pyspark/pandas/data_type_ops/datetime_ops.py | 72 python/pyspark/pandas/data_type_ops/num_ops.py | 378 + python/pyspark/pandas/data_type_ops/string_ops.py | 104 ++ .../pyspark/pandas/tests/data_type_ops/__init__.py | 16 + .../pandas/tests/data_type_ops/test_boolean_ops.py | 150 .../tests/data_type_ops/test_categorical_ops.py| 128 +++ .../pandas/tests/data_type_ops/test_date_ops.py| 158 + .../tests/data_type_ops/test_datetime_ops.py | 160 + .../pandas/tests/data_type_ops/test_num_ops.py | 195 +++ .../pandas/tests/data_type_ops/test_string_ops.py | 140 .../pandas/tests/data_type_ops/testing_utils.py| 75 .../pyspark/pandas/tests/indexes/test_datetime.py | 10 +- python/pyspark/pandas/tests/test_dataframe.py | 4 +- .../pyspark/pandas/tests/test_series_datetime.py | 10 +- python/pyspark/testing/pandasutils.py | 2 +- python/setup.py| 1 + 23 files changed, 1881 insertions(+), 259 deletions(-) diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py index ab65ccd..c35618e 100644 --- a/dev/sparktestsupport/modules.py +++ b/dev/sparktestsupport/modules.py @@ -611,6 +611,12 @@ pyspark_pandas = Module( "pyspark.pandas.spark.utils", "pyspark.pandas.typedef.typehints", # unittests +"pyspark.pandas.tests.data_type_ops.test_boolean_ops", +"pyspark.pandas.tests.data_type_ops.test_categorical_ops", +"pyspark.pandas.tests.data_type_ops.test_date_ops", +"pyspark.pandas.tests.data_type_ops.test_datetime_ops", +"pyspark.pandas.tests.data_type_ops.test_num_ops", +"pyspark.pandas.tests.data_type_ops.test_string_ops", "pyspark.pandas.tests.indexes.test_base", "pyspark.pandas.tests.indexes.test_category", "pyspark.pandas.tests.indexes.test_datetime", diff --git a/python/pyspark/pandas/base.py b/python/pyspark/pandas/base.py index 1082052..6cecb73 100644 --- a/python/pyspark/pandas/base.py +++ b/python/pyspark/pandas/base.py @@ -19,7 +19,6 @@ Base and utility classes for pandas-on-Spark objects. """ from abc import ABCMeta, abstractmethod -import datetime from functools import wraps, partial from itertools import chain from typing import Any, Callable, Optional, Tuple, Union, cast, TYPE_CHECKING @@ -35,7 +34,6 @@ from pyspark.sql.types import ( DateType, DoubleType, FloatType, -IntegralType, LongType, NumericType, StringType, @@ -50,11 +48,9 @@ from pyspark.pandas.internal
[spark] branch branch-3.0 updated: [SPARK-35373][BUILD][FOLLOWUP][3.1] Update zinc installation
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new d4b637e [SPARK-35373][BUILD][FOLLOWUP][3.1] Update zinc installation d4b637e is described below commit d4b637e13d0893f7fdb71703e56bc7f0dfc93f90 Author: Dongjoon Hyun AuthorDate: Wed May 19 09:51:58 2021 -0700 [SPARK-35373][BUILD][FOLLOWUP][3.1] Update zinc installation ### What changes were proposed in this pull request? This PR is a follow-up of https://github.com/apache/spark/pull/32505 to fix `zinc` installation. ### Why are the changes needed? Currently, branch-3.1/3.0 is broken. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the GitHub Action. Closes #32591 from dongjoon-hyun/SPARK-35373. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit 50edf1246809ffe2b142cab4a9dc4e4e72df3fe7) Signed-off-by: Dongjoon Hyun --- build/mvn | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/build/mvn b/build/mvn index 9c9ea4b..d304f86 100755 --- a/build/mvn +++ b/build/mvn @@ -150,7 +150,10 @@ install_zinc() { local TYPESAFE_MIRROR=${TYPESAFE_MIRROR:-https://downloads.lightbend.com} install_app \ - "${TYPESAFE_MIRROR}/zinc/${ZINC_VERSION}/zinc-${ZINC_VERSION}.tgz" \ + "${TYPESAFE_MIRROR}" \ + "zinc/${ZINC_VERSION}/zinc-${ZINC_VERSION}.tgz" \ + "" \ + "" \ "zinc-${ZINC_VERSION}.tgz" \ "${zinc_path}" ZINC_BIN="${_DIR}/${zinc_path}" - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.1 updated: [SPARK-35373][BUILD][FOLLOWUP][3.1] Update zinc installation
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.1 by this push: new 50edf12 [SPARK-35373][BUILD][FOLLOWUP][3.1] Update zinc installation 50edf12 is described below commit 50edf1246809ffe2b142cab4a9dc4e4e72df3fe7 Author: Dongjoon Hyun AuthorDate: Wed May 19 09:51:58 2021 -0700 [SPARK-35373][BUILD][FOLLOWUP][3.1] Update zinc installation ### What changes were proposed in this pull request? This PR is a follow-up of https://github.com/apache/spark/pull/32505 to fix `zinc` installation. ### Why are the changes needed? Currently, branch-3.1/3.0 is broken. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the GitHub Action. Closes #32591 from dongjoon-hyun/SPARK-35373. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- build/mvn | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/build/mvn b/build/mvn index 4e3a8fa..e62abf3 100755 --- a/build/mvn +++ b/build/mvn @@ -142,7 +142,10 @@ install_zinc() { local TYPESAFE_MIRROR=${TYPESAFE_MIRROR:-https://downloads.lightbend.com} install_app \ - "${TYPESAFE_MIRROR}/zinc/${ZINC_VERSION}/zinc-${ZINC_VERSION}.tgz" \ + "${TYPESAFE_MIRROR}" \ + "zinc/${ZINC_VERSION}/zinc-${ZINC_VERSION}.tgz" \ + "" \ + "" \ "zinc-${ZINC_VERSION}.tgz" \ "${zinc_path}" ZINC_BIN="${_DIR}/${zinc_path}" - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] HyukjinKwon commented on pull request #343: Make 2.4.8 as EOL release
HyukjinKwon commented on pull request #343: URL: https://github.com/apache/spark-website/pull/343#issuecomment-844272890 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-35373][BUILD] Check Maven artifact checksum in build/mvn
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 706f91e [SPARK-35373][BUILD] Check Maven artifact checksum in build/mvn 706f91e is described below commit 706f91e5e0ae3c6aa156232546f815840f977201 Author: Sean Owen AuthorDate: Thu May 13 09:06:57 2021 -0500 [SPARK-35373][BUILD] Check Maven artifact checksum in build/mvn ### What changes were proposed in this pull request? `./build/mvn` now downloads the .sha512 checksum of Maven artifacts it downloads, and checks the checksum after download. ### Why are the changes needed? This ensures the integrity of the Maven artifact during a user's build, which may come from several non-ASF mirrors. ### Does this PR introduce _any_ user-facing change? Should not affect anything about Spark per se, just the build. ### How was this patch tested? Manual testing wherein I forced Maven/Scala download, verified checksums are downloaded and checked, and verified it fails on error with a corrupted checksum. Closes #32505 from srowen/SPARK-35373. Authored-by: Sean Owen Signed-off-by: Sean Owen --- build/mvn | 90 +-- 1 file changed, 65 insertions(+), 25 deletions(-) diff --git a/build/mvn b/build/mvn index f33dedc..9c9ea4b 100755 --- a/build/mvn +++ b/build/mvn @@ -26,14 +26,23 @@ _COMPILE_JVM_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=1g" # Installs any application tarball given a URL, the expected tarball name, # and, optionally, a checkable binary path to determine if the binary has -# already been installed -## Arg1 - URL -## Arg2 - Tarball Name -## Arg3 - Checkable Binary +# already been installed. Arguments: +# 1 - Mirror host +# 2 - URL path on host +# 3 - URL query string +# 4 - checksum suffix +# 5 - Tarball Name +# 6 - Checkable Binary install_app() { - local remote_tarball="$1" - local local_tarball="${_DIR}/$2" - local binary="${_DIR}/$3" + local mirror_host="$1" + local url_path="$2" + local url_query="$3" + local checksum_suffix="$4" + local local_tarball="${_DIR}/$5" + local binary="${_DIR}/$6" + local remote_tarball="${mirror_host}/${url_path}${url_query}" + local local_checksum="${local_tarball}.${checksum_suffix}" + local remote_checksum="https://archive.apache.org/dist/${url_path}.${checksum_suffix}"; # setup `curl` and `wget` silent options if we're running on Jenkins local curl_opts="-L" @@ -46,24 +55,46 @@ install_app() { wget_opts="--progress=bar:force ${wget_opts}" fi - if [ -z "$3" -o ! -f "$binary" ]; then + if [ ! -f "$binary" ]; then # check if we already have the tarball # check if we have curl installed # download application -[ ! -f "${local_tarball}" ] && [ $(command -v curl) ] && \ - echo "exec: curl ${curl_opts} ${remote_tarball}" 1>&2 && \ +if [ ! -f "${local_tarball}" -a $(command -v curl) ]; then + echo "exec: curl ${curl_opts} ${remote_tarball}" 1>&2 curl ${curl_opts} "${remote_tarball}" > "${local_tarball}" + if [ ! -z "${checksum_suffix}" ]; then +echo "exec: curl ${curl_opts} ${remote_checksum}" 1>&2 +curl ${curl_opts} "${remote_checksum}" > "${local_checksum}" + fi +fi # if the file still doesn't exist, lets try `wget` and cross our fingers -[ ! -f "${local_tarball}" ] && [ $(command -v wget) ] && \ - echo "exec: wget ${wget_opts} ${remote_tarball}" 1>&2 && \ +if [ ! -f "${local_tarball}" -a $(command -v wget) ]; then + echo "exec: wget ${wget_opts} ${remote_tarball}" 1>&2 wget ${wget_opts} -O "${local_tarball}" "${remote_tarball}" + if [ ! -z "${checksum_suffix}" ]; then +echo "exec: wget ${wget_opts} ${remote_checksum}" 1>&2 +wget ${wget_opts} -O "${local_checksum}" "${remote_checksum}" + fi +fi # if both were unsuccessful, exit -[ ! -f "${local_tarball}" ] && \ - echo -n "ERROR: Cannot download $2 with cURL or wget; " && \ - echo "please install manually and try again." && \ +if [ ! -f "${local_tarball}" ]; then + echo -n "ERROR: Cannot download ${remote_tarball} with cURL or wget; please install manually and try again." exit 2 -cd "${_DIR}" && tar -xzf "$2" -rm -rf "$local_tarball" +fi +# Checksum may not have been specified; don't check if doesn't exist +if [ -f "${local_checksum}" ]; then + echo " ${local_tarball}" >> ${local_checksum} # two spaces + file are important! + # Assuming SHA512 here for now + echo "Veryfing checksum from ${local_checksum}" 1>&2 + if ! shasum -a 512 -q -c "${local_checksum}" ; then +echo "Bad checksum from ${remote_checksum}" +exit 2 + fi +fi + +cd "${_DIR}" && tar
[spark] branch branch-3.1 updated: [SPARK-35373][BUILD] Check Maven artifact checksum in build/mvn
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.1 by this push: new 5d4de77 [SPARK-35373][BUILD] Check Maven artifact checksum in build/mvn 5d4de77 is described below commit 5d4de77a00324d249df31e61a94812c5b93b7b40 Author: Sean Owen AuthorDate: Thu May 13 09:06:57 2021 -0500 [SPARK-35373][BUILD] Check Maven artifact checksum in build/mvn ### What changes were proposed in this pull request? `./build/mvn` now downloads the .sha512 checksum of Maven artifacts it downloads, and checks the checksum after download. ### Why are the changes needed? This ensures the integrity of the Maven artifact during a user's build, which may come from several non-ASF mirrors. ### Does this PR introduce _any_ user-facing change? Should not affect anything about Spark per se, just the build. ### How was this patch tested? Manual testing wherein I forced Maven/Scala download, verified checksums are downloaded and checked, and verified it fails on error with a corrupted checksum. Closes #32505 from srowen/SPARK-35373. Authored-by: Sean Owen Signed-off-by: Sean Owen --- build/mvn | 90 +-- 1 file changed, 65 insertions(+), 25 deletions(-) diff --git a/build/mvn b/build/mvn index b25df09..4e3a8fa 100755 --- a/build/mvn +++ b/build/mvn @@ -26,36 +26,67 @@ _COMPILE_JVM_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=1g" # Installs any application tarball given a URL, the expected tarball name, # and, optionally, a checkable binary path to determine if the binary has -# already been installed -## Arg1 - URL -## Arg2 - Tarball Name -## Arg3 - Checkable Binary +# already been installed. Arguments: +# 1 - Mirror host +# 2 - URL path on host +# 3 - URL query string +# 4 - checksum suffix +# 5 - Tarball Name +# 6 - Checkable Binary install_app() { - local remote_tarball="$1" - local local_tarball="${_DIR}/$2" - local binary="${_DIR}/$3" + local mirror_host="$1" + local url_path="$2" + local url_query="$3" + local checksum_suffix="$4" + local local_tarball="${_DIR}/$5" + local binary="${_DIR}/$6" + local remote_tarball="${mirror_host}/${url_path}${url_query}" + local local_checksum="${local_tarball}.${checksum_suffix}" + local remote_checksum="https://archive.apache.org/dist/${url_path}.${checksum_suffix}"; local curl_opts="--silent --show-error -L" local wget_opts="--no-verbose" - if [ -z "$3" -o ! -f "$binary" ]; then + if [ ! -f "$binary" ]; then # check if we already have the tarball # check if we have curl installed # download application -[ ! -f "${local_tarball}" ] && [ $(command -v curl) ] && \ - echo "exec: curl ${curl_opts} ${remote_tarball}" 1>&2 && \ +if [ ! -f "${local_tarball}" -a $(command -v curl) ]; then + echo "exec: curl ${curl_opts} ${remote_tarball}" 1>&2 curl ${curl_opts} "${remote_tarball}" > "${local_tarball}" + if [ ! -z "${checksum_suffix}" ]; then +echo "exec: curl ${curl_opts} ${remote_checksum}" 1>&2 +curl ${curl_opts} "${remote_checksum}" > "${local_checksum}" + fi +fi # if the file still doesn't exist, lets try `wget` and cross our fingers -[ ! -f "${local_tarball}" ] && [ $(command -v wget) ] && \ - echo "exec: wget ${wget_opts} ${remote_tarball}" 1>&2 && \ +if [ ! -f "${local_tarball}" -a $(command -v wget) ]; then + echo "exec: wget ${wget_opts} ${remote_tarball}" 1>&2 wget ${wget_opts} -O "${local_tarball}" "${remote_tarball}" + if [ ! -z "${checksum_suffix}" ]; then +echo "exec: wget ${wget_opts} ${remote_checksum}" 1>&2 +wget ${wget_opts} -O "${local_checksum}" "${remote_checksum}" + fi +fi # if both were unsuccessful, exit -[ ! -f "${local_tarball}" ] && \ - echo -n "ERROR: Cannot download $2 with cURL or wget; " && \ - echo "please install manually and try again." && \ +if [ ! -f "${local_tarball}" ]; then + echo -n "ERROR: Cannot download ${remote_tarball} with cURL or wget; please install manually and try again." exit 2 -cd "${_DIR}" && tar -xzf "$2" -rm -rf "$local_tarball" +fi +# Checksum may not have been specified; don't check if doesn't exist +if [ -f "${local_checksum}" ]; then + echo " ${local_tarball}" >> ${local_checksum} # two spaces + file are important! + # Assuming SHA512 here for now + echo "Veryfing checksum from ${local_checksum}" 1>&2 + if ! shasum -a 512 -q -c "${local_checksum}" ; then +echo "Bad checksum from ${remote_checksum}" +exit 2 + fi +fi + +cd "${_DIR}" && tar -xzf "${local_tarball}" +rm -rf "${local_tarball}" +rm -f "${local_checksum}" fi } @@ -71,21 +
[spark] branch branch-3.0 updated: [SPARK-35093][SQL] AQE now uses newQueryStage plan as key for looking up cached exchanges for re-use
This is an automated email from the ASF dual-hosted git repository. tgraves pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 1606791 [SPARK-35093][SQL] AQE now uses newQueryStage plan as key for looking up cached exchanges for re-use 1606791 is described below commit 1606791bf54d785eca1ae42fbd98e68a1d58e884 Author: Andy Grove AuthorDate: Wed May 19 07:45:26 2021 -0500 [SPARK-35093][SQL] AQE now uses newQueryStage plan as key for looking up cached exchanges for re-use ### What changes were proposed in this pull request? AQE has an optimization where it attempts to reuse compatible exchanges but it does not take into account whether the exchanges are columnar or not, resulting in incorrect reuse under some circumstances. This PR simply changes the key used to lookup cached stages. It now uses the canonicalized form of the new query stage (potentially created by a plugin) rather than using the canonicalized form of the original exchange. ### Why are the changes needed? When using the [RAPIDS Accelerator for Apache Spark](https://github.com/NVIDIA/spark-rapids) we sometimes see a new query stage correctly create a row-based exchange and then Spark replaces it with a cached columnar exchange, which is not compatible, and this causes queries to fail. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The patch has been tested with the query that highlighted this issue. I looked at writing unit tests for this but it would involve implementing a mock columnar exchange in the tests so would be quite a bit of work. If anyone has ideas on other ways to test this I am happy to hear them. Closes #32195 from andygrove/SPARK-35093. Authored-by: Andy Grove Signed-off-by: Thomas Graves (cherry picked from commit 52e3cf9ff50b4209e29cb06df09b1ef3a18bc83b) Signed-off-by: Thomas Graves --- .../apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala| 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala index 187827c..e7a8034 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala @@ -349,7 +349,8 @@ case class AdaptiveSparkPlanExec( // Check the `stageCache` again for reuse. If a match is found, ditch the new stage // and reuse the existing stage found in the `stageCache`, otherwise update the // `stageCache` with the new stage. - val queryStage = context.stageCache.getOrElseUpdate(e.canonicalized, newStage) + val queryStage = context.stageCache.getOrElseUpdate( +newStage.plan.canonicalized, newStage) if (queryStage.ne(newStage)) { newStage = reuseQueryStage(queryStage, e) } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.1 updated: [SPARK-35093][SQL] AQE now uses newQueryStage plan as key for looking up cached exchanges for re-use
This is an automated email from the ASF dual-hosted git repository. tgraves pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.1 by this push: new 73abceb [SPARK-35093][SQL] AQE now uses newQueryStage plan as key for looking up cached exchanges for re-use 73abceb is described below commit 73abceb05d64eafeb39866c69a84d0b7f3c1f097 Author: Andy Grove AuthorDate: Wed May 19 07:45:26 2021 -0500 [SPARK-35093][SQL] AQE now uses newQueryStage plan as key for looking up cached exchanges for re-use ### What changes were proposed in this pull request? AQE has an optimization where it attempts to reuse compatible exchanges but it does not take into account whether the exchanges are columnar or not, resulting in incorrect reuse under some circumstances. This PR simply changes the key used to lookup cached stages. It now uses the canonicalized form of the new query stage (potentially created by a plugin) rather than using the canonicalized form of the original exchange. ### Why are the changes needed? When using the [RAPIDS Accelerator for Apache Spark](https://github.com/NVIDIA/spark-rapids) we sometimes see a new query stage correctly create a row-based exchange and then Spark replaces it with a cached columnar exchange, which is not compatible, and this causes queries to fail. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The patch has been tested with the query that highlighted this issue. I looked at writing unit tests for this but it would involve implementing a mock columnar exchange in the tests so would be quite a bit of work. If anyone has ideas on other ways to test this I am happy to hear them. Closes #32195 from andygrove/SPARK-35093. Authored-by: Andy Grove Signed-off-by: Thomas Graves (cherry picked from commit 52e3cf9ff50b4209e29cb06df09b1ef3a18bc83b) Signed-off-by: Thomas Graves --- .../apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala| 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala index 89d3b53..596c8b3 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala @@ -418,7 +418,8 @@ case class AdaptiveSparkPlanExec( // Check the `stageCache` again for reuse. If a match is found, ditch the new stage // and reuse the existing stage found in the `stageCache`, otherwise update the // `stageCache` with the new stage. - val queryStage = context.stageCache.getOrElseUpdate(e.canonicalized, newStage) + val queryStage = context.stageCache.getOrElseUpdate( +newStage.plan.canonicalized, newStage) if (queryStage.ne(newStage)) { newStage = reuseQueryStage(queryStage, e) } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-35093][SQL] AQE now uses newQueryStage plan as key for looking up cached exchanges for re-use
This is an automated email from the ASF dual-hosted git repository. tgraves pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 52e3cf9 [SPARK-35093][SQL] AQE now uses newQueryStage plan as key for looking up cached exchanges for re-use 52e3cf9 is described below commit 52e3cf9ff50b4209e29cb06df09b1ef3a18bc83b Author: Andy Grove AuthorDate: Wed May 19 07:45:26 2021 -0500 [SPARK-35093][SQL] AQE now uses newQueryStage plan as key for looking up cached exchanges for re-use ### What changes were proposed in this pull request? AQE has an optimization where it attempts to reuse compatible exchanges but it does not take into account whether the exchanges are columnar or not, resulting in incorrect reuse under some circumstances. This PR simply changes the key used to lookup cached stages. It now uses the canonicalized form of the new query stage (potentially created by a plugin) rather than using the canonicalized form of the original exchange. ### Why are the changes needed? When using the [RAPIDS Accelerator for Apache Spark](https://github.com/NVIDIA/spark-rapids) we sometimes see a new query stage correctly create a row-based exchange and then Spark replaces it with a cached columnar exchange, which is not compatible, and this causes queries to fail. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The patch has been tested with the query that highlighted this issue. I looked at writing unit tests for this but it would involve implementing a mock columnar exchange in the tests so would be quite a bit of work. If anyone has ideas on other ways to test this I am happy to hear them. Closes #32195 from andygrove/SPARK-35093. Authored-by: Andy Grove Signed-off-by: Thomas Graves --- .../apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala| 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala index 256aacb..766788a 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala @@ -435,7 +435,8 @@ case class AdaptiveSparkPlanExec( // Check the `stageCache` again for reuse. If a match is found, ditch the new stage // and reuse the existing stage found in the `stageCache`, otherwise update the // `stageCache` with the new stage. - val queryStage = context.stageCache.getOrElseUpdate(e.canonicalized, newStage) + val queryStage = context.stageCache.getOrElseUpdate( +newStage.plan.canonicalized, newStage) if (queryStage.ne(newStage)) { newStage = reuseQueryStage(queryStage, e) } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (9283beb -> 1214213)
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 9283beb [SPARK-35418][SQL] Add sentences function to functions.{scala,py} add 1214213 [SPARK-35362][SQL] Update null count in the column stats for UNION operator stats estimation No new revisions were added by this update. Summary of changes: .../logical/statsEstimation/FilterEstimation.scala | 2 +- .../logical/statsEstimation/UnionEstimation.scala | 97 ++ .../BasicStatsEstimationSuite.scala| 2 +- .../statsEstimation/UnionEstimationSuite.scala | 65 +-- 4 files changed, 122 insertions(+), 44 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (46f7d78 -> 9283beb)
This is an automated email from the ASF dual-hosted git repository. sarutak pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 46f7d78 [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation add 9283beb [SPARK-35418][SQL] Add sentences function to functions.{scala,py} No new revisions were added by this update. Summary of changes: python/docs/source/reference/pyspark.sql.rst | 1 + python/pyspark/sql/functions.py| 39 ++ python/pyspark/sql/functions.pyi | 5 +++ .../scala/org/apache/spark/sql/functions.scala | 19 +++ .../apache/spark/sql/StringFunctionsSuite.scala| 7 5 files changed, 71 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (a72d05c -> 46f7d78)
This is an automated email from the ASF dual-hosted git repository. yamamuro pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from a72d05c [SPARK-35106][CORE][SQL] Avoid failing rename caused by destination directory not exist add 46f7d78 [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation No new revisions were added by this update. Summary of changes: .../plans/logical/basicLogicalOperators.scala | 43 +++- .../BasicStatsEstimationSuite.scala| 81 ++ 2 files changed, 108 insertions(+), 16 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-35106][CORE][SQL] Avoid failing rename caused by destination directory not exist
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 0f78d8b [SPARK-35106][CORE][SQL] Avoid failing rename caused by destination directory not exist 0f78d8b is described below commit 0f78d8bd211e2dcf946e7fa658bce600843bad1b Author: Yuzhou Sun AuthorDate: Wed May 19 15:46:27 2021 +0800 [SPARK-35106][CORE][SQL] Avoid failing rename caused by destination directory not exist ### What changes were proposed in this pull request? 1. In HadoopMapReduceCommitProtocol, create parent directory before renaming custom partition path staging files 2. In InMemoryCatalog and HiveExternalCatalog, create new partition directory before renaming old partition path 3. Check return value of FileSystem#rename, if false, throw exception to avoid silent data loss cause by rename failure 4. Change DebugFilesystem#rename behavior to make it match HDFS's behavior (return false without rename when dst parent directory not exist) ### Why are the changes needed? Depends on FileSystem#rename implementation, when destination directory does not exist, file system may 1. return false without renaming file nor throwing exception (e.g. HDFS), or 2. create destination directory, rename files, and return true (e.g. LocalFileSystem) In the first case above, renames in HadoopMapReduceCommitProtocol for custom partition path will fail silently if the destination partition path does not exist. Failed renames can happen when 1. dynamicPartitionOverwrite == true, the custom partition path directories are deleted by the job before the rename; or 2. the custom partition path directories do not exist before the job; or 3. something else is wrong when file system handle `rename` The renames in MemoryCatalog and HiveExternalCatalog for partition renaming also have similar issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Modified DebugFilesystem#rename, and added new unit tests. Without the fix in src code, five InsertSuite tests and one AlterTableRenamePartitionSuite test failed: InsertSuite.SPARK-20236: dynamic partition overwrite with custom partition path (existing test with modified FS) ``` == Results == !== Correct Answer - 1 == == Spark Answer - 0 == struct<> struct<> ![2,1,1] ``` InsertSuite.SPARK-35106: insert overwrite with custom partition path ``` == Results == !== Correct Answer - 1 == == Spark Answer - 0 == struct<> struct<> ![2,1,1] ``` InsertSuite.SPARK-35106: dynamic partition overwrite with custom partition path ``` == Results == !== Correct Answer - 2 == == Spark Answer - 1 == !struct<> struct [1,1,1][1,1,1] ![1,1,2] ``` InsertSuite.SPARK-35106: Throw exception when rename custom partition paths returns false ``` Expected exception org.apache.spark.SparkException to be thrown, but no exception was thrown ``` InsertSuite.SPARK-35106: Throw exception when rename dynamic partition paths returns false ``` Expected exception org.apache.spark.SparkException to be thrown, but no exception was thrown ``` AlterTableRenamePartitionSuite.ALTER TABLE .. RENAME PARTITION V1: multi part partition (existing test with modified FS) ``` == Results == !== Correct Answer - 1 == == Spark Answer - 0 == struct<> struct<> ![3,123,3] ``` Closes #32530 from YuzhouSun/SPARK-35106. Authored-by: Yuzhou Sun Signed-off-by: Wenchen Fan (cherry picked from commit a72d05c7e632fbb0d8a6082c3cacdf61f36518b4) Signed-off-by: Wenchen Fan --- .../io/HadoopMapReduceCommitProtocol.scala | 19 +++- .../scala/org/apache/spark/DebugFilesystem.scala | 14 ++- .../sql/catalyst/catalog/InMemoryCatalog.scala | 6 +- .../org/apache/spark/sql/sources/InsertSuite.scala | 116 - .../spark/sql/hive/HiveExternalCatalog.scala | 6 +- 5 files changed, 151 insertions(+), 10 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala b/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala index 11ce608..a03a999 100644 --- a/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala +++ b/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala @@ -173,13 +173,18 @@ class HadoopMapReduceCommitProtocol( val filesToMove = allAbsPathFiles.foldLeft(Map[String, String]())(_ ++ _) logDebug(s"Commit
[spark] branch branch-3.1 updated: [SPARK-35106][CORE][SQL] Avoid failing rename caused by destination directory not exist
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.1 by this push: new 484db68 [SPARK-35106][CORE][SQL] Avoid failing rename caused by destination directory not exist 484db68 is described below commit 484db68d6901a98d4dd82411def2534f12c14f29 Author: Yuzhou Sun AuthorDate: Wed May 19 15:46:27 2021 +0800 [SPARK-35106][CORE][SQL] Avoid failing rename caused by destination directory not exist ### What changes were proposed in this pull request? 1. In HadoopMapReduceCommitProtocol, create parent directory before renaming custom partition path staging files 2. In InMemoryCatalog and HiveExternalCatalog, create new partition directory before renaming old partition path 3. Check return value of FileSystem#rename, if false, throw exception to avoid silent data loss cause by rename failure 4. Change DebugFilesystem#rename behavior to make it match HDFS's behavior (return false without rename when dst parent directory not exist) ### Why are the changes needed? Depends on FileSystem#rename implementation, when destination directory does not exist, file system may 1. return false without renaming file nor throwing exception (e.g. HDFS), or 2. create destination directory, rename files, and return true (e.g. LocalFileSystem) In the first case above, renames in HadoopMapReduceCommitProtocol for custom partition path will fail silently if the destination partition path does not exist. Failed renames can happen when 1. dynamicPartitionOverwrite == true, the custom partition path directories are deleted by the job before the rename; or 2. the custom partition path directories do not exist before the job; or 3. something else is wrong when file system handle `rename` The renames in MemoryCatalog and HiveExternalCatalog for partition renaming also have similar issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Modified DebugFilesystem#rename, and added new unit tests. Without the fix in src code, five InsertSuite tests and one AlterTableRenamePartitionSuite test failed: InsertSuite.SPARK-20236: dynamic partition overwrite with custom partition path (existing test with modified FS) ``` == Results == !== Correct Answer - 1 == == Spark Answer - 0 == struct<> struct<> ![2,1,1] ``` InsertSuite.SPARK-35106: insert overwrite with custom partition path ``` == Results == !== Correct Answer - 1 == == Spark Answer - 0 == struct<> struct<> ![2,1,1] ``` InsertSuite.SPARK-35106: dynamic partition overwrite with custom partition path ``` == Results == !== Correct Answer - 2 == == Spark Answer - 1 == !struct<> struct [1,1,1][1,1,1] ![1,1,2] ``` InsertSuite.SPARK-35106: Throw exception when rename custom partition paths returns false ``` Expected exception org.apache.spark.SparkException to be thrown, but no exception was thrown ``` InsertSuite.SPARK-35106: Throw exception when rename dynamic partition paths returns false ``` Expected exception org.apache.spark.SparkException to be thrown, but no exception was thrown ``` AlterTableRenamePartitionSuite.ALTER TABLE .. RENAME PARTITION V1: multi part partition (existing test with modified FS) ``` == Results == !== Correct Answer - 1 == == Spark Answer - 0 == struct<> struct<> ![3,123,3] ``` Closes #32530 from YuzhouSun/SPARK-35106. Authored-by: Yuzhou Sun Signed-off-by: Wenchen Fan (cherry picked from commit a72d05c7e632fbb0d8a6082c3cacdf61f36518b4) Signed-off-by: Wenchen Fan --- .../io/HadoopMapReduceCommitProtocol.scala | 19 +++- .../scala/org/apache/spark/DebugFilesystem.scala | 14 ++- .../sql/catalyst/catalog/InMemoryCatalog.scala | 6 +- .../org/apache/spark/sql/sources/InsertSuite.scala | 116 - .../spark/sql/hive/HiveExternalCatalog.scala | 6 +- 5 files changed, 151 insertions(+), 10 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala b/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala index 30f9a65..c061d61 100644 --- a/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala +++ b/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala @@ -188,13 +188,18 @@ class HadoopMapReduceCommitProtocol( val filesToMove = allAbsPathFiles.foldLeft(Map[String, String]())(_ ++ _) logDebug(s"Commit
[spark] branch master updated (0b3758e -> a72d05c)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 0b3758e [SPARK-35421][SS] Remove redundant ProjectExec from streaming queries with V2Relation add a72d05c [SPARK-35106][CORE][SQL] Avoid failing rename caused by destination directory not exist No new revisions were added by this update. Summary of changes: .../io/HadoopMapReduceCommitProtocol.scala | 19 +++- .../scala/org/apache/spark/DebugFilesystem.scala | 14 ++- .../sql/catalyst/catalog/InMemoryCatalog.scala | 6 +- .../org/apache/spark/sql/sources/InsertSuite.scala | 116 - .../spark/sql/hive/HiveExternalCatalog.scala | 6 +- 5 files changed, 151 insertions(+), 10 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org