[spark] branch master updated: [SPARK-41862][SQL][TESTS][FOLLOWUP] Update OrcReadBenchmark result
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new aa7e54fc072 [SPARK-41862][SQL][TESTS][FOLLOWUP] Update OrcReadBenchmark result aa7e54fc072 is described below commit aa7e54fc0728af99cc8c55f3bd88b4ade4aaab05 Author: Dongjoon Hyun AuthorDate: Tue Jan 3 22:05:50 2023 -0800 [SPARK-41862][SQL][TESTS][FOLLOWUP] Update OrcReadBenchmark result ### What changes were proposed in this pull request? This PR is a follow-up of https://github.com/apache/spark/pull/39370 . ### Why are the changes needed? To sync the patch with the recovered perf result. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. - Java 8: https://github.com/dongjoon-hyun/spark/actions/runs/3834890434 - Java 11: https://github.com/dongjoon-hyun/spark/actions/runs/3834892478 - Java 17: https://github.com/dongjoon-hyun/spark/actions/runs/3834893844 Closes #39380 from dongjoon-hyun/SPARK-41862. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../benchmarks/OrcReadBenchmark-jdk11-results.txt | 188 ++--- .../benchmarks/OrcReadBenchmark-jdk17-results.txt | 188 ++--- sql/hive/benchmarks/OrcReadBenchmark-results.txt | 188 ++--- 3 files changed, 282 insertions(+), 282 deletions(-) diff --git a/sql/hive/benchmarks/OrcReadBenchmark-jdk11-results.txt b/sql/hive/benchmarks/OrcReadBenchmark-jdk11-results.txt index 5c44741b591..7d6db9ae30d 100644 --- a/sql/hive/benchmarks/OrcReadBenchmark-jdk11-results.txt +++ b/sql/hive/benchmarks/OrcReadBenchmark-jdk11-results.txt @@ -3,52 +3,52 @@ SQL Single Numeric Column Scan OpenJDK 64-Bit Server VM 11.0.17+8 on Linux 5.15.0-1023-azure -Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz +Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz SQL Single TINYINT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -Hive built-in ORC 1088 1102 18 14.5 69.2 1.0X -Native ORC MR 905971 90 17.4 57.5 1.2X -Native ORC Vectorized 137206 69114.4 8.7 7.9X +Hive built-in ORC 1087 1119 45 14.5 69.1 1.0X +Native ORC MR 882936 50 17.8 56.1 1.2X +Native ORC Vectorized 164213 31 96.0 10.4 6.6X OpenJDK 64-Bit Server VM 11.0.17+8 on Linux 5.15.0-1023-azure -Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz +Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz SQL Single SMALLINT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -Hive built-in ORC 1265 1279 20 12.4 80.4 1.0X -Native ORC MR 1022 1102 113 15.4 65.0 1.2X -Native ORC Vectorized 135201 63116.8 8.6 9.4X +Hive built-in ORC 1282 1289 10 12.3 81.5 1.0X +Native ORC MR 916962 65 17.2 58.2 1.4X +Native ORC Vectorized 151212 47104.1 9.6 8.5X OpenJDK 64-Bit Server VM 11.0.17+8 on Linux 5.15.0-1023-azure -Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz +Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz SQL Single INT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -Hive built-in ORC 1196 1258 88 13.1 76.0 1.0X -Native ORC MR 995 1014 27 15.8 63.3 1.2X -Native ORC Vectorized
[spark] 01/03: [SPARK-38261][INFRA] Add missing R packages from base image
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git commit b8978fecea4580300c51be748a6afb99fb352974 Author: khalidmammadov AuthorDate: Mon Feb 21 11:04:48 2022 +0900 [SPARK-38261][INFRA] Add missing R packages from base image Current GitHub workflow job **Linters, licenses, dependencies and documentation generation** is missing R packages to complete Documentation and API build. **Build and test** - is not failing as these packages are installed on the base image. We need to keep them in-sync IMO with the base image for easy switch back to ubuntu runner when ready. Reference: [**The base image**](https://hub.docker.com/layers/dongjoon/apache-spark-github-action-image/20220207/images/sha256-af09d172ff8e2cbd71df9a1bc5384a47578c4a4cc293786c539333cafaf4a7ce?context=explore) Adding missing packages to the workflow file To make them inline with the base image config and make the job task **complete** for standalone execution (i.e. without this image) No GitHub builds and in the local Docker containers Closes #35583 from khalidmammadov/sync_doc_build_with_base. Authored-by: khalidmammadov Signed-off-by: Hyukjin Kwon (cherry picked from commit 898542746b2c56b2571562ed8e9818bcb565aff2) Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 4d8e6db73e0..2d2e96fdee8 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -406,7 +406,7 @@ jobs: python3.9 -m pip install 'docutils<0.18.0' # See SPARK-39421 apt-get update -y apt-get install -y ruby ruby-dev -Rscript -e "install.packages(c('devtools', 'testthat', 'knitr', 'rmarkdown', 'roxygen2'), repos='https://cloud.r-project.org/')" +Rscript -e "install.packages(c('devtools', 'testthat', 'knitr', 'rmarkdown', 'markdown', 'e1071', 'roxygen2'), repos='https://cloud.r-project.org/')" gem install bundler cd docs bundle install - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated (576ca6e43c3 -> 0f5e231923b)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git from 576ca6e43c3 Revert "[SPARK-36939][PYTHON][DOCS] Add orphan migration page into list in PySpark documentation" new b8978fecea4 [SPARK-38261][INFRA] Add missing R packages from base image new 706cecdc028 [SPARK-39596][INFRA] Install `ggplot2` for GitHub Action linter job new 0f5e231923b [SPARK-39596][INFRA][FOLLOWUP] Install `mvtnorm` and `statmod` at linter job The 3 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: .github/workflows/build_and_test.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] 02/03: [SPARK-39596][INFRA] Install `ggplot2` for GitHub Action linter job
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git commit 706cecdc02833e2ea8f2137383cd0ff1222e8f44 Author: Dongjoon Hyun AuthorDate: Sat Jun 25 00:31:54 2022 -0700 [SPARK-39596][INFRA] Install `ggplot2` for GitHub Action linter job ### What changes were proposed in this pull request? This PR aims to fix GitHub Action linter job by installing `ggplot2`. ### Why are the changes needed? It starts to fail like the following. - https://github.com/apache/spark/runs/7047294196?check_suite_focus=true ``` x Failed to parse Rd in histogram.Rd ℹ there is no package called ‘ggplot2’ ``` ### Does this PR introduce _any_ user-facing change? No. This is a dev-only change. ### How was this patch tested? Pass the GitHub Action linter job. Closes #36987 from dongjoon-hyun/SPARK-39596. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit bf59f6e4bd7f34f8a36bfef1e93e0ddccddf9e43) Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 1 + 1 file changed, 1 insertion(+) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 2d2e96fdee8..7639fea7e79 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -385,6 +385,7 @@ jobs: libtiff5-dev libjpeg-dev Rscript -e "install.packages(c('devtools'), repos='https://cloud.r-project.org/')" Rscript -e "devtools::install_version('lintr', version='2.0.1', repos='https://cloud.r-project.org')" +Rscript -e "install.packages(c('ggplot2'), repos='https://cloud.r-project.org/')" ./R/install-dev.sh - name: Instll JavaScript linter dependencies run: | - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] 03/03: [SPARK-39596][INFRA][FOLLOWUP] Install `mvtnorm` and `statmod` at linter job
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git commit 0f5e231923b77d39239acef80b654834834e9b29 Author: Dongjoon Hyun AuthorDate: Sat Jun 25 20:37:53 2022 +0900 [SPARK-39596][INFRA][FOLLOWUP] Install `mvtnorm` and `statmod` at linter job Closes #36988 from dongjoon-hyun/SPARK-39596-2. Authored-by: Dongjoon Hyun Signed-off-by: Hyukjin Kwon (cherry picked from commit 4c79cc7d5f0d818e479565f5d623e168d777ba0a) Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 7639fea7e79..4a4840995a1 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -385,7 +385,6 @@ jobs: libtiff5-dev libjpeg-dev Rscript -e "install.packages(c('devtools'), repos='https://cloud.r-project.org/')" Rscript -e "devtools::install_version('lintr', version='2.0.1', repos='https://cloud.r-project.org')" -Rscript -e "install.packages(c('ggplot2'), repos='https://cloud.r-project.org/')" ./R/install-dev.sh - name: Instll JavaScript linter dependencies run: | @@ -407,7 +406,7 @@ jobs: python3.9 -m pip install 'docutils<0.18.0' # See SPARK-39421 apt-get update -y apt-get install -y ruby ruby-dev -Rscript -e "install.packages(c('devtools', 'testthat', 'knitr', 'rmarkdown', 'markdown', 'e1071', 'roxygen2'), repos='https://cloud.r-project.org/')" +Rscript -e "install.packages(c('devtools', 'testthat', 'knitr', 'rmarkdown', 'markdown', 'e1071', 'roxygen2', 'ggplot2', 'mvtnorm', 'statmod'), repos='https://cloud.r-project.org/')" gem install bundler cd docs bundle install - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (0b786901633 -> 3130ca9748b)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 0b786901633 [SPARK-41850][CONNECT][PYTHON][TESTS] Enable doctest for `isnan` add 3130ca9748b [SPARK-41859][SQL] CreateHiveTableAsSelectCommand should set the overwrite flag correctly No new revisions were added by this update. Summary of changes: .../sql/hive/execution/CreateHiveTableAsSelectCommand.scala | 12 +--- 1 file changed, 5 insertions(+), 7 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: Revert "[SPARK-36939][PYTHON][DOCS] Add orphan migration page into list in PySpark documentation"
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 576ca6e43c3 Revert "[SPARK-36939][PYTHON][DOCS] Add orphan migration page into list in PySpark documentation" 576ca6e43c3 is described below commit 576ca6e43c37570bc920cc5239ecbb29a4e34560 Author: Dongjoon Hyun AuthorDate: Tue Jan 3 21:03:20 2023 -0800 Revert "[SPARK-36939][PYTHON][DOCS] Add orphan migration page into list in PySpark documentation" This reverts commit 0565d95a86e738d24e9c05a4c5c3c3815944b4be. --- python/docs/source/migration_guide/index.rst | 1 - 1 file changed, 1 deletion(-) diff --git a/python/docs/source/migration_guide/index.rst b/python/docs/source/migration_guide/index.rst index 2e61653a9a5..b25ac313c7c 100644 --- a/python/docs/source/migration_guide/index.rst +++ b/python/docs/source/migration_guide/index.rst @@ -25,7 +25,6 @@ This page describes the migration guide specific to PySpark. .. toctree:: :maxdepth: 2 - pyspark_3.2_to_3.3 pyspark_3.1_to_3.2 pyspark_2.4_to_3.0 pyspark_2.3_to_2.4 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-41850][CONNECT][PYTHON][TESTS] Enable doctest for `isnan`
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 0b786901633 [SPARK-41850][CONNECT][PYTHON][TESTS] Enable doctest for `isnan` 0b786901633 is described below commit 0b786901633f3b8942dcb4c25e6c8a1671d3c0d6 Author: Ruifeng Zheng AuthorDate: Wed Jan 4 12:28:20 2023 +0800 [SPARK-41850][CONNECT][PYTHON][TESTS] Enable doctest for `isnan` ### What changes were proposed in this pull request? Enable doctest for `isnan`, it had been resolved in https://github.com/apache/spark/pull/39360 ### Why are the changes needed? for test coverage ### Does this PR introduce _any_ user-facing change? no, test-only ### How was this patch tested? enabled doctest Closes #39376 from zhengruifeng/connect_fix_41850. Authored-by: Ruifeng Zheng Signed-off-by: Ruifeng Zheng --- python/pyspark/sql/connect/functions.py | 3 --- 1 file changed, 3 deletions(-) diff --git a/python/pyspark/sql/connect/functions.py b/python/pyspark/sql/connect/functions.py index c8ddd0cea7c..7a50906ee39 100644 --- a/python/pyspark/sql/connect/functions.py +++ b/python/pyspark/sql/connect/functions.py @@ -2444,9 +2444,6 @@ def _test() -> None: # TODO(SPARK-41849): implement DataFrameReader.text del pyspark.sql.connect.functions.input_file_name.__doc__ -# TODO(SPARK-41850): fix isnan -del pyspark.sql.connect.functions.isnan.__doc__ - # Creates a remote Spark session. os.environ["SPARK_REMOTE"] = "sc://localhost" globs["spark"] = PySparkSession.builder.remote("sc://localhost").getOrCreate() - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.3 updated: [SPARK-41864][INFRA][PYTHON] Fix mypy linter errors
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new 977e445865c [SPARK-41864][INFRA][PYTHON] Fix mypy linter errors 977e445865c is described below commit 977e445865c0b835b19db44a129502c135b3348a Author: Dongjoon Hyun AuthorDate: Tue Jan 3 15:00:50 2023 -0800 [SPARK-41864][INFRA][PYTHON] Fix mypy linter errors Currently, the GitHub Action Python linter job is broken. This PR will recover Python linter failure. There are two kind of failures. 1. https://github.com/apache/spark/actions/runs/3829330032/jobs/6524170799 ``` python/pyspark/pandas/sql_processor.py:221: error: unused "type: ignore" comment Found 1 error in 1 file (checked 380 source files) ``` 2. After fixing (1), we hit the following. ``` ModuleNotFoundError: No module named 'py._path'; 'py' is not a package ``` No. Pass the GitHub CI on this PR. Or, manually run the following. ``` $ dev/lint-python starting python compilation test... python compilation succeeded. starting black test... black checks passed. starting flake8 test... flake8 checks passed. starting mypy annotations test... annotations passed mypy checks. starting mypy examples test... examples passed mypy checks. starting mypy data test... annotations passed data checks. all lint-python tests passed! ``` Closes #39373 from dongjoon-hyun/SPARK-41864. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit 13b2856e6e77392a417d2bb2ce804f873ee72b28) Signed-off-by: Dongjoon Hyun --- dev/requirements.txt | 1 + python/pyspark/pandas/sql_processor.py | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/dev/requirements.txt b/dev/requirements.txt index e7e0a4b4274..79a70624312 100644 --- a/dev/requirements.txt +++ b/dev/requirements.txt @@ -43,3 +43,4 @@ PyGithub # pandas API on Spark Code formatter. black +py diff --git a/python/pyspark/pandas/sql_processor.py b/python/pyspark/pandas/sql_processor.py index d8ae6888b68..7cf2f7461ba 100644 --- a/python/pyspark/pandas/sql_processor.py +++ b/python/pyspark/pandas/sql_processor.py @@ -218,7 +218,7 @@ def _get_ipython_scope() -> Dict[str, Any]: in an IPython notebook environment. """ try: -from IPython import get_ipython # type: ignore[import] +from IPython import get_ipython shell = get_ipython() return shell.user_ns - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-36883][INFRA] Upgrade R version to 4.1.1 in CI images
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 09f65c1e304 [SPARK-36883][INFRA] Upgrade R version to 4.1.1 in CI images 09f65c1e304 is described below commit 09f65c1e304ade7036322920a97edc11fad1b194 Author: Dongjoon Hyun AuthorDate: Wed Sep 29 11:39:01 2021 -0700 [SPARK-36883][INFRA] Upgrade R version to 4.1.1 in CI images ### What changes were proposed in this pull request? This PR aims to upgrade GitHub Action CI image to recover CRAN installation failure. ### Why are the changes needed? Sometimes, GitHub Action linter job failed - https://github.com/apache/spark/runs/3739748809 New image have R 4.1.1 and will recover the failure. ``` $ docker run -it --rm dongjoon/apache-spark-github-action-image:20210928 R --version R version 4.1.1 (2021-08-10) -- "Kick Things" Copyright (C) 2021 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under the terms of the GNU General Public License versions 2 or 3. For more information about these matters see https://www.gnu.org/licenses/. ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass `GitHub Action`. Closes #34138 from dongjoon-hyun/SPARK-36883. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit aa9064ad96ff7cefaa4381e912608b0b0d39a09c) Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 45d688ea98e..4d8e6db73e0 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -168,7 +168,7 @@ jobs: name: "Build modules: ${{ matrix.modules }}" runs-on: ubuntu-20.04 container: - image: dongjoon/apache-spark-github-action-image:20210730 + image: dongjoon/apache-spark-github-action-image:20210930 strategy: fail-fast: false matrix: @@ -265,7 +265,7 @@ jobs: name: "Build modules: sparkr" runs-on: ubuntu-20.04 container: - image: dongjoon/apache-spark-github-action-image:20210602 + image: dongjoon/apache-spark-github-action-image:20210930 env: HADOOP_PROFILE: hadoop3.2 HIVE_PROFILE: hive2.3 @@ -328,8 +328,9 @@ jobs: LC_ALL: C.UTF-8 LANG: C.UTF-8 PYSPARK_DRIVER_PYTHON: python3.9 + PYSPARK_PYTHON: python3.9 container: - image: dongjoon/apache-spark-github-action-image:20210602 + image: dongjoon/apache-spark-github-action-image:20210930 steps: - name: Checkout Spark repository uses: actions/checkout@v2 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-41719][CORE] Skip SSLOptions sub-settings if `ssl` is disabled
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 98f7182122e [SPARK-41719][CORE] Skip SSLOptions sub-settings if `ssl` is disabled 98f7182122e is described below commit 98f7182122e151cbf7ea83303e39c44d9acb1a72 Author: Shrikant Prasad AuthorDate: Tue Jan 3 17:40:40 2023 -0800 [SPARK-41719][CORE] Skip SSLOptions sub-settings if `ssl` is disabled ### What changes were proposed in this pull request? In SSLOptions rest of the settings should be set only when ssl is enabled. ### Why are the changes needed? If spark.ssl.enabled is false, there is no use of setting rest of spark.ssl.* settings in SSLOptions as this requires unnecessary operations to be performed to set these properties. Additional implication of trying to set the rest of settings is if any error occurs in setting these properties it will cause job failure which otherwise should have worked since ssl is disabled. For example, if the user doesn't have access to the keystore path which is set in hadoop.security.credential.provider.path of hive-site.xml, it can result in failure while launching spark shell since SSLOptions won't be initialized due to error in accessing the keystore. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added new test. Closes #39221 from shrprasa/ssl_options_fix. Authored-by: Shrikant Prasad Signed-off-by: Dongjoon Hyun --- core/src/main/scala/org/apache/spark/SSLOptions.scala | 4 +++- .../test/scala/org/apache/spark/SSLOptionsSuite.scala | 18 -- 2 files changed, 19 insertions(+), 3 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/SSLOptions.scala b/core/src/main/scala/org/apache/spark/SSLOptions.scala index f1668966d8e..d159f5717b0 100644 --- a/core/src/main/scala/org/apache/spark/SSLOptions.scala +++ b/core/src/main/scala/org/apache/spark/SSLOptions.scala @@ -181,7 +181,9 @@ private[spark] object SSLOptions extends Logging { ns: String, defaults: Option[SSLOptions] = None): SSLOptions = { val enabled = conf.getBoolean(s"$ns.enabled", defaultValue = defaults.exists(_.enabled)) - +if (!enabled) { + return new SSLOptions() +} val port = conf.getWithSubstitution(s"$ns.port").map(_.toInt) port.foreach { p => require(p >= 0, "Port number must be a non-negative value.") diff --git a/core/src/test/scala/org/apache/spark/SSLOptionsSuite.scala b/core/src/test/scala/org/apache/spark/SSLOptionsSuite.scala index c990d81de2e..81bc4ae9da0 100644 --- a/core/src/test/scala/org/apache/spark/SSLOptionsSuite.scala +++ b/core/src/test/scala/org/apache/spark/SSLOptionsSuite.scala @@ -109,7 +109,7 @@ class SSLOptionsSuite extends SparkFunSuite { val conf = new SparkConf val hadoopConf = new Configuration() conf.set("spark.ssl.enabled", "true") -conf.set("spark.ssl.ui.enabled", "false") +conf.set("spark.ssl.ui.enabled", "true") conf.set("spark.ssl.ui.port", "4242") conf.set("spark.ssl.keyStore", keyStorePath) conf.set("spark.ssl.keyStorePassword", "password") @@ -125,7 +125,7 @@ class SSLOptionsSuite extends SparkFunSuite { val defaultOpts = SSLOptions.parse(conf, hadoopConf, "spark.ssl", defaults = None) val opts = SSLOptions.parse(conf, hadoopConf, "spark.ssl.ui", defaults = Some(defaultOpts)) -assert(opts.enabled === false) +assert(opts.enabled === true) assert(opts.port === Some(4242)) assert(opts.trustStore.isDefined) assert(opts.trustStore.get.getName === "truststore") @@ -140,6 +140,20 @@ class SSLOptionsSuite extends SparkFunSuite { assert(opts.enabledAlgorithms === Set("ABC", "DEF")) } + test("SPARK-41719: Skip ssl sub-settings if ssl is disabled") { +val keyStorePath = new File(this.getClass.getResource("/keystore").toURI).getAbsolutePath +val conf = new SparkConf +val hadoopConf = new Configuration() +conf.set("spark.ssl.enabled", "false") +conf.set("spark.ssl.keyStorePassword", "password") +conf.set("spark.ssl.keyStore", keyStorePath) +val sslOpts = SSLOptions.parse(conf, hadoopConf, "spark.ssl", defaults = None) + +assert(sslOpts.enabled === false) +assert(sslOpts.keyStorePassword === None) +assert(sslOpts.keyStore === None) + } + test("variable substitution") { val conf = new SparkConfWithEnv(Map( "ENV1" -> "val1", - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated (63722c39462 -> 736964e73b7)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git from 63722c39462 [SPARK-41865][INFRA][3.2] Use pycodestyle to 2.7.0 to fix pycodestyle errors add 736964e73b7 [SPARK-41030][BUILD][3.2] Upgrade `Apache Ivy` to 2.5.1 No new revisions were added by this update. Summary of changes: dev/deps/spark-deps-hadoop-2.7-hive-2.3 | 2 +- dev/deps/spark-deps-hadoop-3.2-hive-2.3 | 2 +- pom.xml | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated (ad2d42709ab -> 63722c39462)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git from ad2d42709ab [SPARK-41863][INFRA][PYTHON][TESTS] Skip `flake8` tests if the command is not available add 63722c39462 [SPARK-41865][INFRA][3.2] Use pycodestyle to 2.7.0 to fix pycodestyle errors No new revisions were added by this update. Summary of changes: .github/workflows/build_and_test.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-41862][SQL] Fix correctness bug related to DEFAULT values in Orc reader
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a2392be592b [SPARK-41862][SQL] Fix correctness bug related to DEFAULT values in Orc reader a2392be592b is described below commit a2392be592bf6aa75391ea50cbab77cde152f8ce Author: Daniel Tenedorio AuthorDate: Wed Jan 4 09:30:42 2023 +0900 [SPARK-41862][SQL] Fix correctness bug related to DEFAULT values in Orc reader ### What changes were proposed in this pull request? This PR fixes a correctness bug related to column DEFAULT values in Orc reader. * https://github.com/apache/spark/pull/37280 introduced a performance regression in the Orc reader. * https://github.com/apache/spark/pull/39362 fixed the performance regression, but stopped the column DEFAULT feature from working, causing a temporary correctness regression that we agreed for me to fix later. * This PR restores column DEFAULT functionality for Orc scans and fixes the correctness regression while not reintroducing the performance regression. ### Why are the changes needed? This PR fixes a correctness bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR updates a unit test to exercise that the Orc scan functionality is correct. Closes #39370 from dtenedor/fix-perf-bug-orc-reader. Authored-by: Daniel Tenedorio Signed-off-by: Hyukjin Kwon --- .../datasources/orc/OrcDeserializer.scala | 71 +- .../org/apache/spark/sql/sources/InsertSuite.scala | 15 + 2 files changed, 19 insertions(+), 67 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala index 5b207a04ada..5bac404fd53 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala @@ -42,21 +42,26 @@ class OrcDeserializer( // is always null in this case // - a function that updates target column `index` otherwise. private val fieldWriters: Array[WritableComparable[_] => Unit] = { +// Assume we create a table backed by Orc files. Then if we later run a command "ALTER TABLE t +// ADD COLUMN c DEFAULT " on the Orc table, this adds one field to the Catalyst schema. +// Then if we query the old files with the new Catalyst schema, we should only apply the +// existence default value to the columns whose IDs are not explicitly requested. +if (requiredSchema.hasExistenceDefaultValues) { + for (i <- 0 until requiredSchema.existenceDefaultValues.size) { +requiredSchema.existenceDefaultsBitmask(i) = + if (requestedColIds(i) != -1) { +false + } else { +requiredSchema.existenceDefaultValues(i) != null + } + } +} requiredSchema.zipWithIndex .map { case (f, index) => if (requestedColIds(index) == -1) { null } else { - // Create a RowUpdater instance for converting Orc objects to Catalyst rows. If any fields - // in the Orc result schema have associated existence default values, maintain a - // boolean array to track which fields have been explicitly assigned for each row. - val rowUpdater: RowUpdater = -if (requiredSchema.hasExistenceDefaultValues) { - resetExistenceDefaultsBitmask(requiredSchema) - new RowUpdaterWithBitmask(resultRow, requiredSchema.existenceDefaultsBitmask) -} else { - new RowUpdater(resultRow) -} + val rowUpdater = new RowUpdater(resultRow) val writer = newWriter(f.dataType, rowUpdater) (value: WritableComparable[_]) => writer(index, value) } @@ -93,6 +98,7 @@ class OrcDeserializer( } targetColumnIndex += 1 } +applyExistenceDefaultValuesToRow(requiredSchema, resultRow) resultRow } @@ -288,49 +294,4 @@ class OrcDeserializer( override def setDouble(ordinal: Int, value: Double): Unit = array.setDouble(ordinal, value) override def setFloat(ordinal: Int, value: Float): Unit = array.setFloat(ordinal, value) } - - /** - * Subclass of RowUpdater that also updates a boolean array bitmask. In this way, after all - * assignments are complete, it is possible to inspect the bitmask to determine which columns have - * been written at least once. - */ - final class RowUpdaterWithBitmask( - row: InternalRow, bitmask: Array[Boolean]) extends RowUpdater(row) { -override def setNullAt(ordinal:
[spark] branch master updated (7da7ad3c5b9 -> c26d59864a9)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 7da7ad3c5b9 [SPARK-41423][CORE][BUILD] Exclude StageData.rddIds, this and accumulatorUpdates for Scala 2.13 add c26d59864a9 [SPARK-41856][CONNECT][TESTS] Enable test_create_nan_decimal_dataframe, test_freqItems, test_input_files, test_to_pandas_required_pandas_not_found No new revisions were added by this update. Summary of changes: python/pyspark/sql/tests/connect/test_parity_dataframe.py | 12 python/pyspark/sql/tests/test_dataframe.py| 2 +- 2 files changed, 1 insertion(+), 13 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (1a5ef40a4d5 -> 7da7ad3c5b9)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 1a5ef40a4d5 [SPARK-41863][INFRA][PYTHON][TESTS] Skip `flake8` tests if the command is not available add 7da7ad3c5b9 [SPARK-41423][CORE][BUILD] Exclude StageData.rddIds, this and accumulatorUpdates for Scala 2.13 No new revisions were added by this update. Summary of changes: project/MimaExcludes.scala | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-41863][INFRA][PYTHON][TESTS] Skip `flake8` tests if the command is not available
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new ad2d42709ab [SPARK-41863][INFRA][PYTHON][TESTS] Skip `flake8` tests if the command is not available ad2d42709ab is described below commit ad2d42709abfc8f8ad27f836c811a4b75ef32ee9 Author: Dongjoon Hyun AuthorDate: Tue Jan 3 15:01:43 2023 -0800 [SPARK-41863][INFRA][PYTHON][TESTS] Skip `flake8` tests if the command is not available ### What changes were proposed in this pull request? This PR aims to skip `flake8` tests if the command is not available. ### Why are the changes needed? Linters are optional modules and we can be skip in some systems like `mypy`. ``` $ dev/lint-python starting python compilation test... python compilation succeeded. The Python library providing 'black' module was not found. Skipping black checks for now. The flake8 command was not found. Skipping for now. The mypy command was not found. Skipping for now. all lint-python tests passed! ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual tests. Closes #39372 from dongjoon-hyun/SPARK-41863. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit 1a5ef40a4d59b377b028b55ea3805caf5d55f28f) Signed-off-by: Dongjoon Hyun --- dev/lint-python | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/dev/lint-python b/dev/lint-python index e54e391c587..031b34f4af9 100755 --- a/dev/lint-python +++ b/dev/lint-python @@ -164,9 +164,8 @@ function flake8_test { local FLAKE8_STATUS= if ! hash "$FLAKE8_BUILD" 2> /dev/null; then -echo "The flake8 command was not found." -echo "flake8 checks failed." -exit 1 +echo "The flake8 command was not found. Skipping for now." +return fi _FLAKE8_VERSION=($($FLAKE8_BUILD --version)) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.3 updated: [SPARK-41863][INFRA][PYTHON][TESTS] Skip `flake8` tests if the command is not available
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new 2da30ad0658 [SPARK-41863][INFRA][PYTHON][TESTS] Skip `flake8` tests if the command is not available 2da30ad0658 is described below commit 2da30ad0658406462ede656ad368f890e7051a5e Author: Dongjoon Hyun AuthorDate: Tue Jan 3 15:01:43 2023 -0800 [SPARK-41863][INFRA][PYTHON][TESTS] Skip `flake8` tests if the command is not available ### What changes were proposed in this pull request? This PR aims to skip `flake8` tests if the command is not available. ### Why are the changes needed? Linters are optional modules and we can be skip in some systems like `mypy`. ``` $ dev/lint-python starting python compilation test... python compilation succeeded. The Python library providing 'black' module was not found. Skipping black checks for now. The flake8 command was not found. Skipping for now. The mypy command was not found. Skipping for now. all lint-python tests passed! ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual tests. Closes #39372 from dongjoon-hyun/SPARK-41863. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit 1a5ef40a4d59b377b028b55ea3805caf5d55f28f) Signed-off-by: Dongjoon Hyun --- dev/lint-python | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/dev/lint-python b/dev/lint-python index f0ca8832be0..5505f4b1105 100755 --- a/dev/lint-python +++ b/dev/lint-python @@ -173,9 +173,8 @@ function flake8_test { local FLAKE8_STATUS= if ! hash "$FLAKE8_BUILD" 2> /dev/null; then -echo "The flake8 command was not found." -echo "flake8 checks failed." -exit 1 +echo "The flake8 command was not found. Skipping for now." +return fi _FLAKE8_VERSION=($($FLAKE8_BUILD --version)) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-41863][INFRA][PYTHON][TESTS] Skip `flake8` tests if the command is not available
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 1a5ef40a4d5 [SPARK-41863][INFRA][PYTHON][TESTS] Skip `flake8` tests if the command is not available 1a5ef40a4d5 is described below commit 1a5ef40a4d59b377b028b55ea3805caf5d55f28f Author: Dongjoon Hyun AuthorDate: Tue Jan 3 15:01:43 2023 -0800 [SPARK-41863][INFRA][PYTHON][TESTS] Skip `flake8` tests if the command is not available ### What changes were proposed in this pull request? This PR aims to skip `flake8` tests if the command is not available. ### Why are the changes needed? Linters are optional modules and we can be skip in some systems like `mypy`. ``` $ dev/lint-python starting python compilation test... python compilation succeeded. The Python library providing 'black' module was not found. Skipping black checks for now. The flake8 command was not found. Skipping for now. The mypy command was not found. Skipping for now. all lint-python tests passed! ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual tests. Closes #39372 from dongjoon-hyun/SPARK-41863. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- dev/lint-python | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/dev/lint-python b/dev/lint-python index f1f4e9f1070..b5ee63e3869 100755 --- a/dev/lint-python +++ b/dev/lint-python @@ -175,9 +175,8 @@ function flake8_test { local FLAKE8_STATUS= if ! hash "$FLAKE8_BUILD" 2> /dev/null; then -echo "The flake8 command was not found." -echo "flake8 checks failed." -exit 1 +echo "The flake8 command was not found. Skipping for now." +return fi _FLAKE8_VERSION=($($FLAKE8_BUILD --version)) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-41864][INFRA][PYTHON] Fix mypy linter errors
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 13b2856e6e7 [SPARK-41864][INFRA][PYTHON] Fix mypy linter errors 13b2856e6e7 is described below commit 13b2856e6e77392a417d2bb2ce804f873ee72b28 Author: Dongjoon Hyun AuthorDate: Tue Jan 3 15:00:50 2023 -0800 [SPARK-41864][INFRA][PYTHON] Fix mypy linter errors ### What changes were proposed in this pull request? Currently, the GitHub Action Python linter job is broken. This PR will recover Python linter failure. ### Why are the changes needed? There are two kind of failures. 1. https://github.com/apache/spark/actions/runs/3829330032/jobs/6524170799 ``` python/pyspark/pandas/sql_processor.py:221: error: unused "type: ignore" comment Found 1 error in 1 file (checked 380 source files) ``` 2. After fixing (1), we hit the following. ``` ModuleNotFoundError: No module named 'py._path'; 'py' is not a package ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the GitHub CI on this PR. Or, manually run the following. ``` $ dev/lint-python starting python compilation test... python compilation succeeded. starting black test... black checks passed. starting flake8 test... flake8 checks passed. starting mypy annotations test... annotations passed mypy checks. starting mypy examples test... examples passed mypy checks. starting mypy data test... annotations passed data checks. all lint-python tests passed! ``` Closes #39373 from dongjoon-hyun/SPARK-41864. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- dev/requirements.txt | 1 + python/pyspark/pandas/sql_processor.py | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/dev/requirements.txt b/dev/requirements.txt index c3911b57eb9..1d978c4602c 100644 --- a/dev/requirements.txt +++ b/dev/requirements.txt @@ -47,6 +47,7 @@ PyGithub # pandas API on Spark Code formatter. black==22.6.0 +py # Spark Connect (required) grpcio==1.48.1 diff --git a/python/pyspark/pandas/sql_processor.py b/python/pyspark/pandas/sql_processor.py index ec6b0498511..28e2329b8f9 100644 --- a/python/pyspark/pandas/sql_processor.py +++ b/python/pyspark/pandas/sql_processor.py @@ -218,7 +218,7 @@ def _get_ipython_scope() -> Dict[str, Any]: in an IPython notebook environment. """ try: -from IPython import get_ipython # type: ignore[import] +from IPython import get_ipython shell = get_ipython() return shell.user_ns - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (23aec321bd8 -> 7ede493bfca)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 23aec321bd8 [SPARK-41049][SQL][FOLLOWUP] Move expression initialization code to the base class add 7ede493bfca [SPARK-41814][SPARK-41851][SPARK-41852][FOLLOW-UP] Reeanble skipped doctests No new revisions were added by this update. Summary of changes: python/pyspark/sql/column.py| 3 +-- python/pyspark/sql/connect/functions.py | 6 -- 2 files changed, 1 insertion(+), 8 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-41049][SQL][FOLLOWUP] Move expression initialization code to the base class
This is an automated email from the ASF dual-hosted git repository. viirya pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 23aec321bd8 [SPARK-41049][SQL][FOLLOWUP] Move expression initialization code to the base class 23aec321bd8 is described below commit 23aec321bd822867a698ee3bc17b21753ce8 Author: Wenchen Fan AuthorDate: Tue Jan 3 10:46:44 2023 -0800 [SPARK-41049][SQL][FOLLOWUP] Move expression initialization code to the base class ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/39248 , to add one more code cleanup. The expression initialization code is duplicated 6 times and we should put it in the base class. ### Why are the changes needed? code cleanup ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #39364 from cloud-fan/expr. Authored-by: Wenchen Fan Signed-off-by: Liang-Chi Hsieh --- .../spark/sql/catalyst/expressions/ExpressionsEvaluator.scala | 7 +++ .../sql/catalyst/expressions/InterpretedMutableProjection.scala| 5 + .../spark/sql/catalyst/expressions/InterpretedSafeProjection.scala | 5 + .../sql/catalyst/expressions/InterpretedUnsafeProjection.scala | 5 + .../org/apache/spark/sql/catalyst/expressions/Projection.scala | 5 + .../org/apache/spark/sql/catalyst/expressions/predicates.scala | 6 +- 6 files changed, 12 insertions(+), 21 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExpressionsEvaluator.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExpressionsEvaluator.scala index dcbc6926cd3..1fc0144fede 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExpressionsEvaluator.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExpressionsEvaluator.scala @@ -42,4 +42,11 @@ trait ExpressionsEvaluator { * The default implementation does nothing. */ def initialize(partitionIndex: Int): Unit = {} + + protected def initializeExprs(exprs: Seq[Expression], partitionIndex: Int): Unit = { +exprs.foreach(_.foreach { + case n: Nondeterministic => n.initialize(partitionIndex) + case _ => +}) + } } diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedMutableProjection.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedMutableProjection.scala index 682604b9bf7..01e9de085da 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedMutableProjection.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedMutableProjection.scala @@ -41,10 +41,7 @@ class InterpretedMutableProjection(expressions: Seq[Expression]) extends Mutable private[this] val buffer = new Array[Any](expressions.size) override def initialize(partitionIndex: Int): Unit = { -exprs.foreach(_.foreach { - case n: Nondeterministic => n.initialize(partitionIndex) - case _ => -}) +initializeExprs(exprs, partitionIndex) } private[this] val validExprs = expressions.zipWithIndex.filter { diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedSafeProjection.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedSafeProjection.scala index 84263d97f5d..87539e80b0b 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedSafeProjection.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedSafeProjection.scala @@ -101,10 +101,7 @@ class InterpretedSafeProjection(expressions: Seq[Expression]) extends Projection } override def initialize(partitionIndex: Int): Unit = { -expressions.foreach(_.foreach { - case n: Nondeterministic => n.initialize(partitionIndex) - case _ => -}) +initializeExprs(exprs, partitionIndex) } override def apply(row: InternalRow): InternalRow = { diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedUnsafeProjection.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedUnsafeProjection.scala index 9108a045c09..90a90444695 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedUnsafeProjection.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedUnsafeProjection.scala @@ -67,10 +67,7 @@ class InterpretedUnsafeProjection(expressions: Array[Expression]) extends Unsafe } override def initialize(partitionIndex: Int): Unit
[spark] branch master updated: [SPARK-41858][SQL] Fix ORC reader perf regression due to DEFAULT value feature
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d81e55e1ff9 [SPARK-41858][SQL] Fix ORC reader perf regression due to DEFAULT value feature d81e55e1ff9 is described below commit d81e55e1ff998c624fa80c5660d7724701b4df23 Author: Dongjoon Hyun AuthorDate: Tue Jan 3 10:40:44 2023 -0800 [SPARK-41858][SQL] Fix ORC reader perf regression due to DEFAULT value feature ### What changes were proposed in this pull request? This PR is a partial and logical revert of SPARK-39862, https://github.com/apache/spark/pull/37280, to fix the huge ORC reader perf regression (3x slower). SPARK-39862 should propose a fix without perf regression. ### Why are the changes needed? During Apache Spark 3.4.0 preparation, SPARK-41782 identified a perf regression. - https://github.com/apache/spark/pull/39301#discussion_r1059239575 ### Does this PR introduce _any_ user-facing change? After this PR, the regression is removed. However, the bug of DEFAULT value feature will remain. This should be handled separately. ### How was this patch tested? Pass the CI. Closes #39362 from dongjoon-hyun/SPARK-41858. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../execution/datasources/orc/OrcDeserializer.scala | 21 +++-- .../org/apache/spark/sql/sources/InsertSuite.scala | 9 +++-- 2 files changed, 18 insertions(+), 12 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala index 5276f5c6d7b..5b207a04ada 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala @@ -57,14 +57,7 @@ class OrcDeserializer( } else { new RowUpdater(resultRow) } - val writer: (Int, WritableComparable[_]) => Unit = -(ordinal, value) => - if (value == null) { -rowUpdater.setNullAt(ordinal) - } else { -val writerFunc = newWriter(f.dataType, rowUpdater) -writerFunc(ordinal, value) - } + val writer = newWriter(f.dataType, rowUpdater) (value: WritableComparable[_]) => writer(index, value) } }.toArray @@ -75,7 +68,11 @@ class OrcDeserializer( while (targetColumnIndex < fieldWriters.length) { if (fieldWriters(targetColumnIndex) != null) { val value = orcStruct.getFieldValue(requestedColIds(targetColumnIndex)) -fieldWriters(targetColumnIndex)(value) +if (value == null) { + resultRow.setNullAt(targetColumnIndex) +} else { + fieldWriters(targetColumnIndex)(value) +} } targetColumnIndex += 1 } @@ -88,7 +85,11 @@ class OrcDeserializer( while (targetColumnIndex < fieldWriters.length) { if (fieldWriters(targetColumnIndex) != null) { val value = orcValues(requestedColIds(targetColumnIndex)) -fieldWriters(targetColumnIndex)(value) +if (value == null) { + resultRow.setNullAt(targetColumnIndex) +} else { + fieldWriters(targetColumnIndex)(value) +} } targetColumnIndex += 1 } diff --git a/sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala index dd37c93871e..7c4a39d6ff4 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala @@ -1679,7 +1679,8 @@ class InsertSuite extends DataSourceTest with SharedSparkSession { Config( None), Config( -Some(SQLConf.ORC_VECTORIZED_READER_ENABLED.key -> "false", +Some(SQLConf.ORC_VECTORIZED_READER_ENABLED.key -> "false"), +insertNullsToStorage = false))), TestCase( dataSource = "parquet", Seq( @@ -1943,7 +1944,11 @@ class InsertSuite extends DataSourceTest with SharedSparkSession { Row(Seq(Row(1, 2)), Seq(Map(false -> "def", true -> "jkl"))), Seq(Map(true -> "xyz"))), Row(2, - null, + if (config.dataSource != "orc") { +null + } else { +Row(Seq(Row(1, 2)), Seq(Map(false -> "def", true -> "jkl"))) + }, Seq(Map(true -> "xyz"))), Row(3, Row(Seq(Row(3, 4)), Seq(Map(false -> "mno", true ->
[spark] branch master updated (3c40be2dddc -> ec594236df4)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 3c40be2dddc [SPARK-41405][SQL] Centralize the column resolution logic add ec594236df4 [SPARK-41853][CORE] Use Map in place of SortedMap for ErrorClassesJsonReader No new revisions were added by this update. Summary of changes: core/src/main/scala/org/apache/spark/ErrorClassesJSONReader.scala | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (f0d9692c5d2 -> 3c40be2dddc)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from f0d9692c5d2 [SPARK-41855][SPARK-41814][SPARK-41851][SPARK-41852][CONNECT][PYTHON] Make `createDataFrame` handle None/NaN properly add 3c40be2dddc [SPARK-41405][SQL] Centralize the column resolution logic No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/analysis/Analyzer.scala | 797 ++--- .../ResolveLateralColumnAliasReference.scala | 24 +- .../spark/sql/catalyst/analysis/unresolved.scala | 19 +- .../catalyst/expressions/namedExpressions.scala| 23 +- .../spark/sql/catalyst/expressions/subquery.scala | 9 +- .../sql/catalyst/rules/RuleIdCollection.scala | 1 - .../spark/sql/catalyst/trees/TreePatterns.scala| 1 + .../apache/spark/sql/LateralColumnAliasSuite.scala | 3 +- 8 files changed, 424 insertions(+), 453 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (5935693185d -> f0d9692c5d2)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 5935693185d [SPARK-41857][CONNECT][TESTS] Enable test_between_function, test_datetime_functions, test_expr, test_math_functions, test_window_functions_cumulative_sum, test_corr, test_cov, test_crosstab, test_approxQuantile add f0d9692c5d2 [SPARK-41855][SPARK-41814][SPARK-41851][SPARK-41852][CONNECT][PYTHON] Make `createDataFrame` handle None/NaN properly No new revisions were added by this update. Summary of changes: python/pyspark/sql/connect/session.py | 52 --- .../sql/tests/connect/test_connect_basic.py| 102 + 2 files changed, 142 insertions(+), 12 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (02f12eeed0c -> 5935693185d)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 02f12eeed0c [SPARK-41658][SPARK-41656][DOCS][FOLLOW-UP] Update JIRAs in skipped tests' comments add 5935693185d [SPARK-41857][CONNECT][TESTS] Enable test_between_function, test_datetime_functions, test_expr, test_math_functions, test_window_functions_cumulative_sum, test_corr, test_cov, test_crosstab, test_approxQuantile No new revisions were added by this update. Summary of changes: .../sql/tests/connect/test_parity_functions.py | 36 -- 1 file changed, 36 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org