(spark) branch master updated: [SPARK-47045][SQL] Replace `IllegalArgumentException` by `SparkIllegalArgumentException` in `sql/api`
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 9ef552691e1d [SPARK-47045][SQL] Replace `IllegalArgumentException` by `SparkIllegalArgumentException` in `sql/api` 9ef552691e1d is described below commit 9ef552691e1d4725d7a64b45e6cdee9e5e75f992 Author: Max Gekk AuthorDate: Thu Feb 15 10:28:21 2024 +0300 [SPARK-47045][SQL] Replace `IllegalArgumentException` by `SparkIllegalArgumentException` in `sql/api` ### What changes were proposed in this pull request? In the PR, I propose to replace all `IllegalArgumentException` by `SparkIllegalArgumentException` in `sql/api` code base, and introduce new legacy error classes with the `_LEGACY_ERROR_TEMP_` prefix. ### Why are the changes needed? To unify Spark SQL exception, and port Java exceptions on Spark exceptions with error classes. ### Does this PR introduce _any_ user-facing change? Yes, it can if user's code assumes some particular format of `IllegalArgumentException` messages. ### How was this patch tested? By running existing test suites like: ``` $ build/sbt "core/testOnly *SparkThrowableSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45098 from MaxGekk/migrate-IllegalArgumentException-sql. Authored-by: Max Gekk Signed-off-by: Max Gekk --- R/pkg/tests/fulltests/test_streaming.R | 3 +- .../src/main/resources/error/error-classes.json| 70 +++ .../src/main/scala/org/apache/spark/sql/Row.scala | 11 ++- .../catalyst/streaming/InternalOutputModes.scala | 7 +- .../catalyst/util/DateTimeFormatterHelper.scala| 18 +++-- .../sql/catalyst/util/SparkIntervalUtils.scala | 8 ++- .../sql/catalyst/util/TimestampFormatter.scala | 6 +- .../spark/sql/execution/streaming/Triggers.scala | 5 +- .../org/apache/spark/sql/types/DataType.scala | 19 ++--- .../org/apache/spark/sql/types/StructType.scala| 25 --- .../results/datetime-formatting-invalid.sql.out| 81 +- .../org/apache/spark/sql/JsonFunctionsSuite.scala | 13 ++-- 12 files changed, 206 insertions(+), 60 deletions(-) diff --git a/R/pkg/tests/fulltests/test_streaming.R b/R/pkg/tests/fulltests/test_streaming.R index 8804471e640c..67479726b57c 100644 --- a/R/pkg/tests/fulltests/test_streaming.R +++ b/R/pkg/tests/fulltests/test_streaming.R @@ -257,7 +257,8 @@ test_that("Trigger", { "Value for trigger.processingTime must be a non-empty string.") expect_error(write.stream(df, "memory", queryName = "times", outputMode = "append", - trigger.processingTime = "invalid"), "illegal argument") + trigger.processingTime = "invalid"), + "Error parsing 'invalid' to interval, unrecognized number 'invalid'") expect_error(write.stream(df, "memory", queryName = "times", outputMode = "append", trigger.once = ""), "Value for trigger.once must be TRUE.") diff --git a/common/utils/src/main/resources/error/error-classes.json b/common/utils/src/main/resources/error/error-classes.json index 5884c9267119..38161ff87720 100644 --- a/common/utils/src/main/resources/error/error-classes.json +++ b/common/utils/src/main/resources/error/error-classes.json @@ -7767,6 +7767,76 @@ "Single backslash is prohibited. It has special meaning as beginning of an escape sequence. To get the backslash character, pass a string with two backslashes as the delimiter." ] }, + "_LEGACY_ERROR_TEMP_3249" : { +"message" : [ + "Failed to convert value (class of }) with the type of to JSON." +] + }, + "_LEGACY_ERROR_TEMP_3250" : { +"message" : [ + "Failed to convert the JSON string '' to a field." +] + }, + "_LEGACY_ERROR_TEMP_3251" : { +"message" : [ + "Failed to convert the JSON string '' to a data type." +] + }, + "_LEGACY_ERROR_TEMP_3252" : { +"message" : [ + " does not exist. Available: " +] + }, + "_LEGACY_ERROR_TEMP_3253" : { +"message" : [ + " do(es) not exist. Available: " +] + }, + "_LEGACY_ERROR_TEMP_3254" : { +"message" : [ + " does not exist. Available: " +] + }, + "_LEGACY_ERROR_TEMP_3255" : { +"message" : [ + "Error parsing '' to interval, " +] + }, + "_LEGACY_ERROR_TEMP_3256" : { +"message" : [ + "Unrecognized datetime pattern: " +] + }, + "_LEGACY_ERROR_TEMP_3257" : { +"message" : [ + "All week-based patterns are unsupported since Spark 3.0, detected: , Please use the SQL function EXTRACT instead" +] + }, + "_LEGACY_ERROR_TEMP_3258" : { +"message" : [ + "Illegal pattern character: " +] + }, + "_LEGACY_ERROR_TEMP_3259" : { +"message" : [ + "Too m
Re: [PR] Remove Apache Spark 3.3.4 EOL version from Download page [spark-website]
dongjoon-hyun merged PR #500: URL: https://github.com/apache/spark-website/pull/500 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
Re: [PR] Remove Apache Spark 3.3.4 EOL version from Download page [spark-website]
dongjoon-hyun commented on PR #500: URL: https://github.com/apache/spark-website/pull/500#issuecomment-1945479250 Thank you, @viirya ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark-website) branch asf-site updated: Remove Apache Spark 3.3.4 EOL version from Download page (#500)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/spark-website.git The following commit(s) were added to refs/heads/asf-site by this push: new 07a8f9831c Remove Apache Spark 3.3.4 EOL version from Download page (#500) 07a8f9831c is described below commit 07a8f9831c34c8056741cf8d58666a7408831259 Author: Dongjoon Hyun AuthorDate: Wed Feb 14 23:06:42 2024 -0800 Remove Apache Spark 3.3.4 EOL version from Download page (#500) --- js/downloads.js | 4 site/js/downloads.js | 4 2 files changed, 8 deletions(-) diff --git a/js/downloads.js b/js/downloads.js index 6d3caff97b..2a5690c041 100644 --- a/js/downloads.js +++ b/js/downloads.js @@ -13,18 +13,14 @@ function addRelease(version, releaseDate, packages, mirrored) { var sources = {pretty: "Source Code", tag: "sources"}; var hadoopFree = {pretty: "Pre-built with user-provided Apache Hadoop", tag: "without-hadoop"}; -var hadoop2p = {pretty: "Pre-built for Apache Hadoop 2.7", tag: "hadoop2"}; var hadoop3p = {pretty: "Pre-built for Apache Hadoop 3.3 and later", tag: "hadoop3"}; var hadoop3pscala213 = {pretty: "Pre-built for Apache Hadoop 3.3 and later (Scala 2.13)", tag: "hadoop3-scala2.13"}; -// 3.3.0+ -var packagesV13 = [hadoop3p, hadoop3pscala213, hadoop2p, hadoopFree, sources]; // 3.4.0+ var packagesV14 = [hadoop3p, hadoop3pscala213, hadoopFree, sources]; addRelease("3.5.0", new Date("09/13/2023"), packagesV14, true); addRelease("3.4.2", new Date("11/30/2023"), packagesV14, true); -addRelease("3.3.4", new Date("12/16/2023"), packagesV13, true); function append(el, contents) { el.innerHTML += contents; diff --git a/site/js/downloads.js b/site/js/downloads.js index 6d3caff97b..2a5690c041 100644 --- a/site/js/downloads.js +++ b/site/js/downloads.js @@ -13,18 +13,14 @@ function addRelease(version, releaseDate, packages, mirrored) { var sources = {pretty: "Source Code", tag: "sources"}; var hadoopFree = {pretty: "Pre-built with user-provided Apache Hadoop", tag: "without-hadoop"}; -var hadoop2p = {pretty: "Pre-built for Apache Hadoop 2.7", tag: "hadoop2"}; var hadoop3p = {pretty: "Pre-built for Apache Hadoop 3.3 and later", tag: "hadoop3"}; var hadoop3pscala213 = {pretty: "Pre-built for Apache Hadoop 3.3 and later (Scala 2.13)", tag: "hadoop3-scala2.13"}; -// 3.3.0+ -var packagesV13 = [hadoop3p, hadoop3pscala213, hadoop2p, hadoopFree, sources]; // 3.4.0+ var packagesV14 = [hadoop3p, hadoop3pscala213, hadoopFree, sources]; addRelease("3.5.0", new Date("09/13/2023"), packagesV14, true); addRelease("3.4.2", new Date("11/30/2023"), packagesV14, true); -addRelease("3.3.4", new Date("12/16/2023"), packagesV13, true); function append(el, contents) { el.innerHTML += contents; - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r67353 - /release/spark/spark-3.3.4/
Author: dongjoon Date: Thu Feb 15 06:27:50 2024 New Revision: 67353 Log: Remove Apache Spark 3.3.4 because it reached the end of support Removed: release/spark/spark-3.3.4/ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
Re: [PR] Remove Apache Spark 3.3.4 EOL version from Download page [spark-website]
dongjoon-hyun commented on code in PR #500: URL: https://github.com/apache/spark-website/pull/500#discussion_r1490473831 ## js/downloads.js: ## @@ -13,18 +13,14 @@ function addRelease(version, releaseDate, packages, mirrored) { var sources = {pretty: "Source Code", tag: "sources"}; var hadoopFree = {pretty: "Pre-built with user-provided Apache Hadoop", tag: "without-hadoop"}; -var hadoop2p = {pretty: "Pre-built for Apache Hadoop 2.7", tag: "hadoop2"}; Review Comment: From now, Hadoop 2 is gone completely together. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
Re: [PR] Remove Apache Spark 3.3.4 EOL version from Download page [spark-website]
dongjoon-hyun commented on code in PR #500: URL: https://github.com/apache/spark-website/pull/500#discussion_r1490473831 ## js/downloads.js: ## @@ -13,18 +13,14 @@ function addRelease(version, releaseDate, packages, mirrored) { var sources = {pretty: "Source Code", tag: "sources"}; var hadoopFree = {pretty: "Pre-built with user-provided Apache Hadoop", tag: "without-hadoop"}; -var hadoop2p = {pretty: "Pre-built for Apache Hadoop 2.7", tag: "hadoop2"}; Review Comment: From now, Hadoop 2 is gone completely. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
Re: [PR] Remove Apache Spark 3.3.4 EOL version from Download page [spark-website]
dongjoon-hyun commented on PR #500: URL: https://github.com/apache/spark-website/pull/500#issuecomment-1945441937 Hi, @HyukjinKwon . Could you review this website update PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[PR] Remove Apache Spark 3.3.4 EOL version from Download page [spark-website]
dongjoon-hyun opened a new pull request, #500: URL: https://github.com/apache/spark-website/pull/500 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46687][TESTS][PYTHON][FOLLOW-UP] Skip MemoryProfilerParityTests when codecov enabled
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d72efc038124 [SPARK-46687][TESTS][PYTHON][FOLLOW-UP] Skip MemoryProfilerParityTests when codecov enabled d72efc038124 is described below commit d72efc0381246370d3efbcd045637dd85ebfcd8f Author: Hyukjin Kwon AuthorDate: Thu Feb 15 14:49:36 2024 +0900 [SPARK-46687][TESTS][PYTHON][FOLLOW-UP] Skip MemoryProfilerParityTests when codecov enabled ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/44775 that skips the tests with codecov on. It fails now (https://github.com/apache/spark/actions/runs/7709423681/job/21010676103) and the coverage report is broken. ### Why are the changes needed? To recover the test coverage report. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually tested. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45112 from HyukjinKwon/SPARK-46687-followup. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon --- python/pyspark/tests/test_memory_profiler.py | 3 +++ 1 file changed, 3 insertions(+) diff --git a/python/pyspark/tests/test_memory_profiler.py b/python/pyspark/tests/test_memory_profiler.py index 3af35a7b43ca..ac3dc34d3474 100644 --- a/python/pyspark/tests/test_memory_profiler.py +++ b/python/pyspark/tests/test_memory_profiler.py @@ -203,6 +203,9 @@ class MemoryProfilerTests(PySparkTestCase): df.mapInPandas(map, schema=df.schema).collect() +@unittest.skipIf( +"COVERAGE_PROCESS_START" in os.environ, "Fails with coverage enabled, skipping for now." +) @unittest.skipIf(not has_memory_profiler, "Must have memory-profiler installed.") class MemoryProfiler2TestsMixin: @contextmanager - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-46906][INFRA][3.5] Bump python libraries (pandas, pyarrow) in Docker image for release script
This is an automated email from the ASF dual-hosted git repository. kabhwan pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 9b4778fc1dc7 [SPARK-46906][INFRA][3.5] Bump python libraries (pandas, pyarrow) in Docker image for release script 9b4778fc1dc7 is described below commit 9b4778fc1dc7047635c9ec19c187d4e75d182590 Author: Jungtaek Lim AuthorDate: Thu Feb 15 14:49:09 2024 +0900 [SPARK-46906][INFRA][3.5] Bump python libraries (pandas, pyarrow) in Docker image for release script ### What changes were proposed in this pull request? This PR proposes to bump python libraries (pandas to 2.0.3, pyarrow to 4.0.0) in Docker image for release script. ### Why are the changes needed? Without this change, release script (do-release-docker.sh) fails on docs phase. Changing this fixes the release process against branch-3.5. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Confirmed with dry-run of release script against branch-3.5. `dev/create-release/do-release-docker.sh -d ~/spark-release -n -s docs` ``` Generating HTML files for SQL API documentation. INFO- Cleaning site directory INFO- Building documentation to directory: /opt/spark-rm/output/spark/sql/site INFO- Documentation built in 0.85 seconds /opt/spark-rm/output/spark/sql Moving back into docs dir. Making directory api/sql cp -r ../sql/site/. api/sql Source: /opt/spark-rm/output/spark/docs Destination: /opt/spark-rm/output/spark/docs/_site Incremental build: disabled. Enable with --incremental Generating... done in 7.469 seconds. Auto-regeneration: disabled. Use --watch to enable. ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45111 from HeartSaVioR/SPARK-46906-3.5. Authored-by: Jungtaek Lim Signed-off-by: Jungtaek Lim --- dev/create-release/spark-rm/Dockerfile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dev/create-release/spark-rm/Dockerfile b/dev/create-release/spark-rm/Dockerfile index cd57226f5e01..789915d018de 100644 --- a/dev/create-release/spark-rm/Dockerfile +++ b/dev/create-release/spark-rm/Dockerfile @@ -42,7 +42,7 @@ ARG APT_INSTALL="apt-get install --no-install-recommends -y" # We should use the latest Sphinx version once this is fixed. # TODO(SPARK-35375): Jinja2 3.0.0+ causes error when building with Sphinx. # See also https://issues.apache.org/jira/browse/SPARK-35375. -ARG PIP_PKGS="sphinx==3.0.4 mkdocs==1.1.2 numpy==1.20.3 pydata_sphinx_theme==0.8.0 ipython==7.19.0 nbsphinx==0.8.0 numpydoc==1.1.0 jinja2==2.11.3 twine==3.4.1 sphinx-plotly-directive==0.1.3 sphinx-copybutton==0.5.2 pandas==1.5.3 pyarrow==3.0.0 plotly==5.4.0 markupsafe==2.0.1 docutils<0.17 grpcio==1.56.0 protobuf==4.21.6 grpcio-status==1.56.0 googleapis-common-protos==1.56.4" +ARG PIP_PKGS="sphinx==3.0.4 mkdocs==1.1.2 numpy==1.20.3 pydata_sphinx_theme==0.8.0 ipython==7.19.0 nbsphinx==0.8.0 numpydoc==1.1.0 jinja2==2.11.3 twine==3.4.1 sphinx-plotly-directive==0.1.3 sphinx-copybutton==0.5.2 pandas==2.0.3 pyarrow==4.0.0 plotly==5.4.0 markupsafe==2.0.1 docutils<0.17 grpcio==1.56.0 protobuf==4.21.6 grpcio-status==1.56.0 googleapis-common-protos==1.56.4" ARG GEM_PKGS="bundler:2.3.8" # Install extra needed repos and refresh. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47051][INFRA] Create a new test pipeline for `yarn` and `connect`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 1b48de5606fb [SPARK-47051][INFRA] Create a new test pipeline for `yarn` and `connect` 1b48de5606fb is described below commit 1b48de5606fbdb26b4459dee0aa94be6560ef14a Author: Dongjoon Hyun AuthorDate: Wed Feb 14 21:12:26 2024 -0800 [SPARK-47051][INFRA] Create a new test pipeline for `yarn` and `connect` ### What changes were proposed in this pull request? This PR aims to spin off `yarn` and `connect` as a new test pipeline for the following: - To stabilize more by off-loading - To re-trigger easily in case of failures. - To isolate `yarn` module change and avoid triggering other module's tests like Kafka module. - To isolate `connect` module change and avoid triggering other module's tests like Kafka module. ### Why are the changes needed? These two modules are known to be flaky in various GitHub Action CI pipelines. - https://github.com/apache/spark/actions/runs/7905202256/job/21577289425 (`YarnClusterSuite`) - https://github.com/apache/spark/actions/runs/7905202256/job/21585092863 (`SparkSessionE2ESuite`) - https://github.com/apache/spark/actions/runs/7828944523/job/21359886644 (`SparkSessionE2ESuite`) - https://github.com/apache/spark/actions/runs/7795415730/job/21258341216 (`SparkSessionE2ESuite`) - https://github.com/apache/spark/actions/runs/7858754074/job/21444107806 (`SparkSessionE2ESuite`) - https://github.com/apache/spark/actions/runs/7879934827/job/21501133320 (`SparkConnectServiceSuite`) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs in this PR. ![Screenshot 2024-02-14 at 15 44 56](https://github.com/apache/spark/assets/9700541/6e735420-914d-44d3-b037-112c3e98d0e6) ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45107 from dongjoon-hyun/SPARK-47051. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .github/workflows/build_and_test.yml | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 1d98727a4231..43903d139d1f 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -147,8 +147,9 @@ jobs: mllib-local, mllib, graphx - >- streaming, sql-kafka-0-10, streaming-kafka-0-10, streaming-kinesis-asl, -yarn, kubernetes, hadoop-cloud, spark-ganglia-lgpl, -connect, protobuf +kubernetes, hadoop-cloud, spark-ganglia-lgpl, protobuf + - >- +yarn, connect # Here, we split Hive and SQL tests into some of slow ones and the rest of them. included-tags: [""] excluded-tags: [""] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: Revert "[SPARK-45396][PYTHON] Add doc entry for `pyspark.ml.connect` module, and adds `Evaluator` to `__all__` at `ml.connect`"
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new ea6b25767fb8 Revert "[SPARK-45396][PYTHON] Add doc entry for `pyspark.ml.connect` module, and adds `Evaluator` to `__all__` at `ml.connect`" ea6b25767fb8 is described below commit ea6b25767fb86732c108c759fd5393caee22f129 Author: Hyukjin Kwon AuthorDate: Thu Feb 15 09:20:57 2024 +0900 Revert "[SPARK-45396][PYTHON] Add doc entry for `pyspark.ml.connect` module, and adds `Evaluator` to `__all__` at `ml.connect`" This reverts commit 35b627a934b1ab28be7d6ba88fdad63dc129525a. --- python/docs/source/reference/index.rst | 1 - .../docs/source/reference/pyspark.ml.connect.rst | 122 - python/pyspark/ml/connect/__init__.py | 3 +- 3 files changed, 1 insertion(+), 125 deletions(-) diff --git a/python/docs/source/reference/index.rst b/python/docs/source/reference/index.rst index 6330636839cd..ed3eb4d07dac 100644 --- a/python/docs/source/reference/index.rst +++ b/python/docs/source/reference/index.rst @@ -31,7 +31,6 @@ Pandas API on Spark follows the API specifications of latest pandas release. pyspark.pandas/index pyspark.ss/index pyspark.ml - pyspark.ml.connect pyspark.streaming pyspark.mllib pyspark diff --git a/python/docs/source/reference/pyspark.ml.connect.rst b/python/docs/source/reference/pyspark.ml.connect.rst deleted file mode 100644 index 1a3e6a593980.. --- a/python/docs/source/reference/pyspark.ml.connect.rst +++ /dev/null @@ -1,122 +0,0 @@ -.. Licensed to the Apache Software Foundation (ASF) under one -or more contributor license agreements. See the NOTICE file -distributed with this work for additional information -regarding copyright ownership. The ASF licenses this file -to you under the Apache License, Version 2.0 (the -"License"); you may not use this file except in compliance -with the License. You may obtain a copy of the License at - -..http://www.apache.org/licenses/LICENSE-2.0 - -.. Unless required by applicable law or agreed to in writing, -software distributed under the License is distributed on an -"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -KIND, either express or implied. See the License for the -specific language governing permissions and limitations -under the License. - - -MLlib (DataFrame-based) for Spark Connect -= - -.. warning:: -The namespace for this package can change in the future Spark version. - - -Pipeline APIs -- - -.. currentmodule:: pyspark.ml.connect - -.. autosummary:: -:template: autosummary/class_with_docs.rst -:toctree: api/ - -Transformer -Estimator -Model -Evaluator -Pipeline -PipelineModel - - -Feature - -.. currentmodule:: pyspark.ml.connect.feature - -.. autosummary:: -:template: autosummary/class_with_docs.rst -:toctree: api/ - -MaxAbsScaler -MaxAbsScalerModel -StandardScaler -StandardScalerModel - - -Classification --- - -.. currentmodule:: pyspark.ml.connect.classification - -.. autosummary:: -:template: autosummary/class_with_docs.rst -:toctree: api/ - -LogisticRegression -LogisticRegressionModel - - -Functions -- - -.. currentmodule:: pyspark.ml.connect.functions - -.. autosummary:: -:toctree: api/ - -array_to_vector -vector_to_array - - -Tuning --- - -.. currentmodule:: pyspark.ml.connect.tuning - -.. autosummary:: -:template: autosummary/class_with_docs.rst -:toctree: api/ - -CrossValidator -CrossValidatorModel - - -Evaluation --- - -.. currentmodule:: pyspark.ml.connect.evaluation - -.. autosummary:: -:template: autosummary/class_with_docs.rst -:toctree: api/ - -RegressionEvaluator -BinaryClassificationEvaluator -MulticlassClassificationEvaluator - - -Utilities -- - -.. currentmodule:: pyspark.ml.connect.io_utils - -.. autosummary:: -:template: autosummary/class_with_docs.rst -:toctree: api/ - -ParamsReadWrite -CoreModelReadWrite -MetaAlgorithmReadWrite - diff --git a/python/pyspark/ml/connect/__init__.py b/python/pyspark/ml/connect/__init__.py index e6115a62ccfe..2ee152f6a38a 100644 --- a/python/pyspark/ml/connect/__init__.py +++ b/python/pyspark/ml/connect/__init__.py @@ -28,14 +28,13 @@ from pyspark.ml.connect import ( evaluation, tuning, ) -from pyspark.ml.connect.evaluation import Evaluator from pyspark.ml.connect.pipeline import Pipeline, PipelineModel __all__ = [ "Estimator", "Transformer", -"Evaluator", +"Estimator", "Model", "feature", "evaluation", - To unsu
(spark) branch master updated: [SPARK-47049][BUILD] Ban non-shaded Hadoop dependencies
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 736d8ab3f00e [SPARK-47049][BUILD] Ban non-shaded Hadoop dependencies 736d8ab3f00e is described below commit 736d8ab3f00e7c5ba1b01c22f6398b636b8492ea Author: Dongjoon Hyun AuthorDate: Wed Feb 14 14:30:40 2024 -0800 [SPARK-47049][BUILD] Ban non-shaded Hadoop dependencies ### What changes were proposed in this pull request? This PR aims to ban `non-shaded` Hadoop dependencies (including transitive ones). ### Why are the changes needed? SPARK-33212 moved to shaded Hadoop dependencies at Apache Spark 3.2.0. This PR will make it sure that we don't have any accidental leftovers. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45106 from dongjoon-hyun/SPARK-47049. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- pom.xml | 4 1 file changed, 4 insertions(+) diff --git a/pom.xml b/pom.xml index 0b6a6955b18b..b83378af30ff 100644 --- a/pom.xml +++ b/pom.xml @@ -2869,6 +2869,10 @@ + org.apache.hadoop:hadoop-common + org.apache.hadoop:hadoop-hdfs-client + org.apache.hadoop:hadoop-mapreduce-client-core + org.apache.hadoop:hadoop-mapreduce-client-jobclient org.jboss.netty org.codehaus.groovy *:*_2.12 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47014][PYTHON][CONNECT] Implement methods dumpPerfProfiles and dumpMemoryProfiles of SparkSession
This is an automated email from the ASF dual-hosted git repository. xinrong pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 4b9e9d7a9b7c [SPARK-47014][PYTHON][CONNECT] Implement methods dumpPerfProfiles and dumpMemoryProfiles of SparkSession 4b9e9d7a9b7c is described below commit 4b9e9d7a9b7c1b21c7d04cdf0095cc069a35b757 Author: Xinrong Meng AuthorDate: Wed Feb 14 10:37:33 2024 -0800 [SPARK-47014][PYTHON][CONNECT] Implement methods dumpPerfProfiles and dumpMemoryProfiles of SparkSession ### What changes were proposed in this pull request? Implement methods dumpPerfProfiles and dumpMemoryProfiles of SparkSession ### Why are the changes needed? Complete support of (v2) SparkSession-based profiling. ### Does this PR introduce _any_ user-facing change? Yes. dumpPerfProfiles and dumpMemoryProfiles of SparkSession are supported. An example of dumpPerfProfiles is shown below. ```py >>> udf("long") ... def add(x): ... return x + 1 ... >>> spark.conf.set("spark.sql.pyspark.udf.profiler", "perf") >>> spark.range(10).select(add("id")).collect() ... >>> spark.dumpPerfProfiles("dummy_dir") >>> os.listdir("dummy_dir") ['udf_2.pstats'] ``` ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45073 from xinrong-meng/dump_profile. Authored-by: Xinrong Meng Signed-off-by: Xinrong Meng --- python/pyspark/sql/connect/session.py | 10 + python/pyspark/sql/profiler.py| 65 +++ python/pyspark/sql/session.py | 10 + python/pyspark/sql/tests/test_udf_profiler.py | 20 + python/pyspark/tests/test_memory_profiler.py | 22 + 5 files changed, 110 insertions(+), 17 deletions(-) diff --git a/python/pyspark/sql/connect/session.py b/python/pyspark/sql/connect/session.py index 9a678c28a6cc..764f71ccc415 100644 --- a/python/pyspark/sql/connect/session.py +++ b/python/pyspark/sql/connect/session.py @@ -958,6 +958,16 @@ class SparkSession: showMemoryProfiles.__doc__ = PySparkSession.showMemoryProfiles.__doc__ +def dumpPerfProfiles(self, path: str, id: Optional[int] = None) -> None: +self._profiler_collector.dump_perf_profiles(path, id) + +dumpPerfProfiles.__doc__ = PySparkSession.dumpPerfProfiles.__doc__ + +def dumpMemoryProfiles(self, path: str, id: Optional[int] = None) -> None: +self._profiler_collector.dump_memory_profiles(path, id) + +dumpMemoryProfiles.__doc__ = PySparkSession.dumpMemoryProfiles.__doc__ + SparkSession.__doc__ = PySparkSession.__doc__ diff --git a/python/pyspark/sql/profiler.py b/python/pyspark/sql/profiler.py index 565752197238..0db9d9b8b9b4 100644 --- a/python/pyspark/sql/profiler.py +++ b/python/pyspark/sql/profiler.py @@ -15,6 +15,7 @@ # limitations under the License. # from abc import ABC, abstractmethod +import os import pstats from threading import RLock from typing import Dict, Optional, TYPE_CHECKING @@ -158,6 +159,70 @@ class ProfilerCollector(ABC): """ ... +def dump_perf_profiles(self, path: str, id: Optional[int] = None) -> None: +""" +Dump the perf profile results into directory `path`. + +.. versionadded:: 4.0.0 + +Parameters +-- +path: str +A directory in which to dump the perf profile. +id : int, optional +A UDF ID to be shown. If not specified, all the results will be shown. +""" +with self._lock: +stats = self._perf_profile_results + +def dump(id: int) -> None: +s = stats.get(id) + +if s is not None: +if not os.path.exists(path): +os.makedirs(path) +p = os.path.join(path, f"udf_{id}_perf.pstats") +s.dump_stats(p) + +if id is not None: +dump(id) +else: +for id in sorted(stats.keys()): +dump(id) + +def dump_memory_profiles(self, path: str, id: Optional[int] = None) -> None: +""" +Dump the memory profile results into directory `path`. + +.. versionadded:: 4.0.0 + +Parameters +-- +path: str +A directory in which to dump the memory profile. +id : int, optional +A UDF ID to be shown. If not specified, all the results will be shown. +""" +with self._lock: +code_map = self._memory_profile_results + +def dump(id: int) -> None: +cm = code_map.get(id) + +if cm is not None: +if not os.path.exists(path): +os.makedirs(path) +p
(spark) branch master updated (7e911cdd0344 -> c1321c01eeea)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 7e911cdd0344 [SPARK-47039][TESTS] Add a checkstyle rule to ban `commons-lang` in Java code add c1321c01eeea [SPARK-47038][BUILD] Remove shaded `protobuf-java` 2.6.1 dependency from `kinesis-asl-assembly` No new revisions were added by this update. Summary of changes: connector/kinesis-asl-assembly/pom.xml | 19 --- 1 file changed, 19 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47039][TESTS] Add a checkstyle rule to ban `commons-lang` in Java code
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 7e911cdd0344 [SPARK-47039][TESTS] Add a checkstyle rule to ban `commons-lang` in Java code 7e911cdd0344 is described below commit 7e911cdd0344f164cc6a2976fa832d50589b3a2c Author: Dongjoon Hyun AuthorDate: Wed Feb 14 09:41:09 2024 -0800 [SPARK-47039][TESTS] Add a checkstyle rule to ban `commons-lang` in Java code ### What changes were proposed in this pull request? This PR aims to add a checkstyle rule to ban `commons-lang` in Java code in favor of `commons-lang3`. ### Why are the changes needed? SPARK-16129 banned `commons-lang` in Scala code since Apache Spark 2.0.0. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45097 from dongjoon-hyun/SPARK-47039. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- dev/checkstyle-suppressions.xml | 2 ++ dev/checkstyle.xml | 1 + 2 files changed, 3 insertions(+) diff --git a/dev/checkstyle-suppressions.xml b/dev/checkstyle-suppressions.xml index 37c03759ad5e..7b20dfb6bce5 100644 --- a/dev/checkstyle-suppressions.xml +++ b/dev/checkstyle-suppressions.xml @@ -62,4 +62,6 @@ files="sql/api/src/main/java/org/apache/spark/sql/streaming/Trigger.java"/> + diff --git a/dev/checkstyle.xml b/dev/checkstyle.xml index 5af15318081a..b9997d2050d1 100644 --- a/dev/checkstyle.xml +++ b/dev/checkstyle.xml @@ -186,6 +186,7 @@ + - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46832][SQL] Introducing Collate and Collation expressions
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 861cca3da4c4 [SPARK-46832][SQL] Introducing Collate and Collation expressions 861cca3da4c4 is described below commit 861cca3da4c446761ccff007c89b214a691b0a72 Author: Aleksandar Tomic AuthorDate: Wed Feb 14 19:14:50 2024 +0300 [SPARK-46832][SQL] Introducing Collate and Collation expressions ### What changes were proposed in this pull request? This PR adds E2E support for `collate` and `collation` expressions. Following changes were made to get us there: 1) Set the right ordering for `PhysicalStringType` based on `collationId`. 2) UTF8String is now just a data holder class - it no longer implements `Comparable` interface. All comparisons must be done through `CollationFactory`. 3) `collate` and `collation` expressions are added. Special syntax for `collate` is enabled - `'hello world' COLLATE 'target_collation' 4) First set of tests is added that covers both core expression and E2E collation tests. ### Why are the changes needed? This PR is part of larger collation track. For more details please refer to design doc attached in parent JIRA ticket. ### Does this PR introduce _any_ user-facing change? This test adds two new expressions and opens up new syntax. ### How was this patch tested? Basic tests are added. In follow up PRs we will add support for more advanced operators and keep adding tests alongside new feature support. ### Was this patch authored or co-authored using generative AI tooling? Yes. Closes #45064 from dbatomic/stringtype_compare. Lead-authored-by: Aleksandar Tomic Co-authored-by: Stefan Kandic Signed-off-by: Max Gekk --- .../spark/sql/catalyst/util/CollationFactory.java | 5 +- .../org/apache/spark/unsafe/types/UTF8String.java | 59 ++- .../apache/spark/unsafe/types/UTF8StringSuite.java | 24 +-- .../types/UTF8StringPropertyCheckSuite.scala | 2 +- .../spark/sql/catalyst/parser/SqlBaseParser.g4 | 1 + .../spark/sql/catalyst/encoders/RowEncoder.scala | 2 +- .../org/apache/spark/sql/types/StringType.scala| 23 ++- .../sql/catalyst/CatalystTypeConverters.scala | 2 +- .../sql/catalyst/analysis/FunctionRegistry.scala | 2 + .../spark/sql/catalyst/encoders/EncoderUtils.scala | 2 +- .../sql/catalyst/expressions/ToStringBase.scala| 4 +- .../aggregate/BloomFilterAggregate.scala | 4 +- .../expressions/codegen/CodeGenerator.scala| 13 +- .../expressions/collationExpressions.scala | 100 .../spark/sql/catalyst/parser/AstBuilder.scala | 8 + .../sql/catalyst/types/PhysicalDataType.scala | 4 +- .../catalyst/expressions/CodeGenerationSuite.scala | 9 +- .../expressions/CollationExpressionSuite.scala | 77 + .../apache/spark/sql/execution/HiveResult.scala| 2 +- .../spark/sql/execution/columnar/ColumnStats.scala | 4 +- .../sql-functions/sql-expression-schema.md | 2 + .../org/apache/spark/sql/CollationSuite.scala | 177 + .../sql/expressions/ExpressionInfoSuite.scala | 5 +- 23 files changed, 484 insertions(+), 47 deletions(-) diff --git a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java index 018fb6cbeb9f..83cac849e848 100644 --- a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java +++ b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java @@ -112,7 +112,7 @@ public final class CollationFactory { collationTable[0] = new Collation( "UCS_BASIC", null, - UTF8String::compareTo, + UTF8String::binaryCompare, "1.0", s -> (long)s.hashCode(), true); @@ -122,7 +122,7 @@ public final class CollationFactory { collationTable[1] = new Collation( "UCS_BASIC_LCASE", null, - Comparator.comparing(UTF8String::toLowerCase), + (s1, s2) -> s1.toLowerCase().binaryCompare(s2.toLowerCase()), "1.0", (s) -> (long)s.toLowerCase().hashCode(), false); @@ -132,7 +132,6 @@ public final class CollationFactory { "UNICODE", Collator.getInstance(ULocale.ROOT), "153.120.0.0", true); collationTable[2].collator.setStrength(Collator.TERTIARY); - // UNICODE case-insensitive comparison (ROOT locale, in ICU + Secondary strength). collationTable[3] = new Collation( "UNICODE_CI", Collator.getInstance(ULocale.ROOT), "153.120.0.0", false); diff --git a/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java b/common/unsafe/src/main/java/o
(spark) branch master updated: [SPARK-47040][CONNECT] Allow Spark Connect Server Script to wait
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a6bed5e9bcc5 [SPARK-47040][CONNECT] Allow Spark Connect Server Script to wait a6bed5e9bcc5 is described below commit a6bed5e9bcc54dac51421263d5ef73c0b6e0b12c Author: Martin Grund AuthorDate: Wed Feb 14 03:03:30 2024 -0800 [SPARK-47040][CONNECT] Allow Spark Connect Server Script to wait ### What changes were proposed in this pull request? Add an option to the command line of `./sbin/start-connect-server.sh` that leaves it running in the foreground for easier debugging. ``` ./sbin/start-connect-server.sh --wait ``` ### Why are the changes needed? Usability ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Manual ### Was this patch authored or co-authored using generative AI tooling? No Closes #45090 from grundprinzip/start_server_wait. Authored-by: Martin Grund Signed-off-by: Dongjoon Hyun --- sbin/start-connect-server.sh | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/sbin/start-connect-server.sh b/sbin/start-connect-server.sh index a347f43db8b1..fecda717eb34 100755 --- a/sbin/start-connect-server.sh +++ b/sbin/start-connect-server.sh @@ -38,4 +38,10 @@ fi . "${SPARK_HOME}/bin/load-spark-env.sh" -exec "${SPARK_HOME}"/sbin/spark-daemon.sh submit $CLASS 1 --name "Spark Connect server" "$@" +if [ "$1" == "--wait" ]; then + shift + exec "${SPARK_HOME}"/bin/spark-submit --class $CLASS 1 --name "Spark Connect Server" "$@" +else + exec "${SPARK_HOME}"/sbin/spark-daemon.sh submit $CLASS 1 --name "Spark Connect server" "$@" +fi + - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org