(spark) branch master updated: [SPARK-47045][SQL] Replace `IllegalArgumentException` by `SparkIllegalArgumentException` in `sql/api`

2024-02-14 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9ef552691e1d [SPARK-47045][SQL] Replace `IllegalArgumentException` by 
`SparkIllegalArgumentException` in `sql/api`
9ef552691e1d is described below

commit 9ef552691e1d4725d7a64b45e6cdee9e5e75f992
Author: Max Gekk 
AuthorDate: Thu Feb 15 10:28:21 2024 +0300

[SPARK-47045][SQL] Replace `IllegalArgumentException` by 
`SparkIllegalArgumentException` in `sql/api`

### What changes were proposed in this pull request?
In the PR, I propose to replace all `IllegalArgumentException` by 
`SparkIllegalArgumentException` in `sql/api` code base, and introduce new 
legacy error classes with the `_LEGACY_ERROR_TEMP_` prefix.

### Why are the changes needed?
To unify Spark SQL exception, and port Java exceptions on Spark exceptions 
with error classes.

### Does this PR introduce _any_ user-facing change?
Yes, it can if user's code assumes some particular format of 
`IllegalArgumentException` messages.

### How was this patch tested?
By running existing test suites like:
```
$ build/sbt "core/testOnly *SparkThrowableSuite"
```

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #45098 from MaxGekk/migrate-IllegalArgumentException-sql.

Authored-by: Max Gekk 
Signed-off-by: Max Gekk 
---
 R/pkg/tests/fulltests/test_streaming.R |  3 +-
 .../src/main/resources/error/error-classes.json| 70 +++
 .../src/main/scala/org/apache/spark/sql/Row.scala  | 11 ++-
 .../catalyst/streaming/InternalOutputModes.scala   |  7 +-
 .../catalyst/util/DateTimeFormatterHelper.scala| 18 +++--
 .../sql/catalyst/util/SparkIntervalUtils.scala |  8 ++-
 .../sql/catalyst/util/TimestampFormatter.scala |  6 +-
 .../spark/sql/execution/streaming/Triggers.scala   |  5 +-
 .../org/apache/spark/sql/types/DataType.scala  | 19 ++---
 .../org/apache/spark/sql/types/StructType.scala| 25 ---
 .../results/datetime-formatting-invalid.sql.out| 81 +-
 .../org/apache/spark/sql/JsonFunctionsSuite.scala  | 13 ++--
 12 files changed, 206 insertions(+), 60 deletions(-)

diff --git a/R/pkg/tests/fulltests/test_streaming.R 
b/R/pkg/tests/fulltests/test_streaming.R
index 8804471e640c..67479726b57c 100644
--- a/R/pkg/tests/fulltests/test_streaming.R
+++ b/R/pkg/tests/fulltests/test_streaming.R
@@ -257,7 +257,8 @@ test_that("Trigger", {
"Value for trigger.processingTime must be a non-empty string.")
 
   expect_error(write.stream(df, "memory", queryName = "times", outputMode = 
"append",
-   trigger.processingTime = "invalid"), "illegal argument")
+   trigger.processingTime = "invalid"),
+   "Error parsing 'invalid' to interval, unrecognized number 
'invalid'")
 
   expect_error(write.stream(df, "memory", queryName = "times", outputMode = 
"append",
trigger.once = ""), "Value for trigger.once must be TRUE.")
diff --git a/common/utils/src/main/resources/error/error-classes.json 
b/common/utils/src/main/resources/error/error-classes.json
index 5884c9267119..38161ff87720 100644
--- a/common/utils/src/main/resources/error/error-classes.json
+++ b/common/utils/src/main/resources/error/error-classes.json
@@ -7767,6 +7767,76 @@
   "Single backslash is prohibited. It has special meaning as beginning of 
an escape sequence. To get the backslash character, pass a string with two 
backslashes as the delimiter."
 ]
   },
+  "_LEGACY_ERROR_TEMP_3249" : {
+"message" : [
+  "Failed to convert value  (class of }) with the type 
of  to JSON."
+]
+  },
+  "_LEGACY_ERROR_TEMP_3250" : {
+"message" : [
+  "Failed to convert the JSON string '' to a field."
+]
+  },
+  "_LEGACY_ERROR_TEMP_3251" : {
+"message" : [
+  "Failed to convert the JSON string '' to a data type."
+]
+  },
+  "_LEGACY_ERROR_TEMP_3252" : {
+"message" : [
+  " does not exist. Available: "
+]
+  },
+  "_LEGACY_ERROR_TEMP_3253" : {
+"message" : [
+  " do(es) not exist. Available: "
+]
+  },
+  "_LEGACY_ERROR_TEMP_3254" : {
+"message" : [
+  " does not exist. Available: "
+]
+  },
+  "_LEGACY_ERROR_TEMP_3255" : {
+"message" : [
+  "Error parsing '' to interval, "
+]
+  },
+  "_LEGACY_ERROR_TEMP_3256" : {
+"message" : [
+  "Unrecognized datetime pattern: "
+]
+  },
+  "_LEGACY_ERROR_TEMP_3257" : {
+"message" : [
+  "All week-based patterns are unsupported since Spark 3.0, detected: , 
Please use the SQL function EXTRACT instead"
+]
+  },
+  "_LEGACY_ERROR_TEMP_3258" : {
+"message" : [
+  "Illegal pattern character: "
+]
+  },
+  "_LEGACY_ERROR_TEMP_3259" : {
+"message" : [
+  "Too m

Re: [PR] Remove Apache Spark 3.3.4 EOL version from Download page [spark-website]

2024-02-14 Thread via GitHub


dongjoon-hyun merged PR #500:
URL: https://github.com/apache/spark-website/pull/500


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



Re: [PR] Remove Apache Spark 3.3.4 EOL version from Download page [spark-website]

2024-02-14 Thread via GitHub


dongjoon-hyun commented on PR #500:
URL: https://github.com/apache/spark-website/pull/500#issuecomment-1945479250

   Thank you, @viirya !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark-website) branch asf-site updated: Remove Apache Spark 3.3.4 EOL version from Download page (#500)

2024-02-14 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/spark-website.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 07a8f9831c Remove Apache Spark 3.3.4 EOL version from Download page 
(#500)
07a8f9831c is described below

commit 07a8f9831c34c8056741cf8d58666a7408831259
Author: Dongjoon Hyun 
AuthorDate: Wed Feb 14 23:06:42 2024 -0800

Remove Apache Spark 3.3.4 EOL version from Download page (#500)
---
 js/downloads.js  | 4 
 site/js/downloads.js | 4 
 2 files changed, 8 deletions(-)

diff --git a/js/downloads.js b/js/downloads.js
index 6d3caff97b..2a5690c041 100644
--- a/js/downloads.js
+++ b/js/downloads.js
@@ -13,18 +13,14 @@ function addRelease(version, releaseDate, packages, 
mirrored) {
 
 var sources = {pretty: "Source Code", tag: "sources"};
 var hadoopFree = {pretty: "Pre-built with user-provided Apache Hadoop", tag: 
"without-hadoop"};
-var hadoop2p = {pretty: "Pre-built for Apache Hadoop 2.7", tag: "hadoop2"};
 var hadoop3p = {pretty: "Pre-built for Apache Hadoop 3.3 and later", tag: 
"hadoop3"};
 var hadoop3pscala213 = {pretty: "Pre-built for Apache Hadoop 3.3 and later 
(Scala 2.13)", tag: "hadoop3-scala2.13"};
 
-// 3.3.0+
-var packagesV13 = [hadoop3p, hadoop3pscala213, hadoop2p, hadoopFree, sources];
 // 3.4.0+
 var packagesV14 = [hadoop3p, hadoop3pscala213, hadoopFree, sources];
 
 addRelease("3.5.0", new Date("09/13/2023"), packagesV14, true);
 addRelease("3.4.2", new Date("11/30/2023"), packagesV14, true);
-addRelease("3.3.4", new Date("12/16/2023"), packagesV13, true);
 
 function append(el, contents) {
   el.innerHTML += contents;
diff --git a/site/js/downloads.js b/site/js/downloads.js
index 6d3caff97b..2a5690c041 100644
--- a/site/js/downloads.js
+++ b/site/js/downloads.js
@@ -13,18 +13,14 @@ function addRelease(version, releaseDate, packages, 
mirrored) {
 
 var sources = {pretty: "Source Code", tag: "sources"};
 var hadoopFree = {pretty: "Pre-built with user-provided Apache Hadoop", tag: 
"without-hadoop"};
-var hadoop2p = {pretty: "Pre-built for Apache Hadoop 2.7", tag: "hadoop2"};
 var hadoop3p = {pretty: "Pre-built for Apache Hadoop 3.3 and later", tag: 
"hadoop3"};
 var hadoop3pscala213 = {pretty: "Pre-built for Apache Hadoop 3.3 and later 
(Scala 2.13)", tag: "hadoop3-scala2.13"};
 
-// 3.3.0+
-var packagesV13 = [hadoop3p, hadoop3pscala213, hadoop2p, hadoopFree, sources];
 // 3.4.0+
 var packagesV14 = [hadoop3p, hadoop3pscala213, hadoopFree, sources];
 
 addRelease("3.5.0", new Date("09/13/2023"), packagesV14, true);
 addRelease("3.4.2", new Date("11/30/2023"), packagesV14, true);
-addRelease("3.3.4", new Date("12/16/2023"), packagesV13, true);
 
 function append(el, contents) {
   el.innerHTML += contents;


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



svn commit: r67353 - /release/spark/spark-3.3.4/

2024-02-14 Thread dongjoon
Author: dongjoon
Date: Thu Feb 15 06:27:50 2024
New Revision: 67353

Log:
Remove Apache Spark 3.3.4 because it reached the end of support

Removed:
release/spark/spark-3.3.4/


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



Re: [PR] Remove Apache Spark 3.3.4 EOL version from Download page [spark-website]

2024-02-14 Thread via GitHub


dongjoon-hyun commented on code in PR #500:
URL: https://github.com/apache/spark-website/pull/500#discussion_r1490473831


##
js/downloads.js:
##
@@ -13,18 +13,14 @@ function addRelease(version, releaseDate, packages, 
mirrored) {
 
 var sources = {pretty: "Source Code", tag: "sources"};
 var hadoopFree = {pretty: "Pre-built with user-provided Apache Hadoop", tag: 
"without-hadoop"};
-var hadoop2p = {pretty: "Pre-built for Apache Hadoop 2.7", tag: "hadoop2"};

Review Comment:
   From now, Hadoop 2 is gone completely together.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



Re: [PR] Remove Apache Spark 3.3.4 EOL version from Download page [spark-website]

2024-02-14 Thread via GitHub


dongjoon-hyun commented on code in PR #500:
URL: https://github.com/apache/spark-website/pull/500#discussion_r1490473831


##
js/downloads.js:
##
@@ -13,18 +13,14 @@ function addRelease(version, releaseDate, packages, 
mirrored) {
 
 var sources = {pretty: "Source Code", tag: "sources"};
 var hadoopFree = {pretty: "Pre-built with user-provided Apache Hadoop", tag: 
"without-hadoop"};
-var hadoop2p = {pretty: "Pre-built for Apache Hadoop 2.7", tag: "hadoop2"};

Review Comment:
   From now, Hadoop 2 is gone completely.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



Re: [PR] Remove Apache Spark 3.3.4 EOL version from Download page [spark-website]

2024-02-14 Thread via GitHub


dongjoon-hyun commented on PR #500:
URL: https://github.com/apache/spark-website/pull/500#issuecomment-1945441937

   Hi, @HyukjinKwon . Could you review this website update PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[PR] Remove Apache Spark 3.3.4 EOL version from Download page [spark-website]

2024-02-14 Thread via GitHub


dongjoon-hyun opened a new pull request, #500:
URL: https://github.com/apache/spark-website/pull/500

   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46687][TESTS][PYTHON][FOLLOW-UP] Skip MemoryProfilerParityTests when codecov enabled

2024-02-14 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d72efc038124 [SPARK-46687][TESTS][PYTHON][FOLLOW-UP] Skip 
MemoryProfilerParityTests when codecov enabled
d72efc038124 is described below

commit d72efc0381246370d3efbcd045637dd85ebfcd8f
Author: Hyukjin Kwon 
AuthorDate: Thu Feb 15 14:49:36 2024 +0900

[SPARK-46687][TESTS][PYTHON][FOLLOW-UP] Skip MemoryProfilerParityTests when 
codecov enabled

### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/44775 that skips 
the tests with codecov on. It fails now 
(https://github.com/apache/spark/actions/runs/7709423681/job/21010676103) and 
the coverage report is broken.

### Why are the changes needed?

To recover the test coverage report.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Manually tested.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45112 from HyukjinKwon/SPARK-46687-followup.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/tests/test_memory_profiler.py | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/python/pyspark/tests/test_memory_profiler.py 
b/python/pyspark/tests/test_memory_profiler.py
index 3af35a7b43ca..ac3dc34d3474 100644
--- a/python/pyspark/tests/test_memory_profiler.py
+++ b/python/pyspark/tests/test_memory_profiler.py
@@ -203,6 +203,9 @@ class MemoryProfilerTests(PySparkTestCase):
 df.mapInPandas(map, schema=df.schema).collect()
 
 
+@unittest.skipIf(
+"COVERAGE_PROCESS_START" in os.environ, "Fails with coverage enabled, 
skipping for now."
+)
 @unittest.skipIf(not has_memory_profiler, "Must have memory-profiler 
installed.")
 class MemoryProfiler2TestsMixin:
 @contextmanager


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch branch-3.5 updated: [SPARK-46906][INFRA][3.5] Bump python libraries (pandas, pyarrow) in Docker image for release script

2024-02-14 Thread kabhwan
This is an automated email from the ASF dual-hosted git repository.

kabhwan pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 9b4778fc1dc7 [SPARK-46906][INFRA][3.5] Bump python libraries (pandas, 
pyarrow) in Docker image for release script
9b4778fc1dc7 is described below

commit 9b4778fc1dc7047635c9ec19c187d4e75d182590
Author: Jungtaek Lim 
AuthorDate: Thu Feb 15 14:49:09 2024 +0900

[SPARK-46906][INFRA][3.5] Bump python libraries (pandas, pyarrow) in Docker 
image for release script

### What changes were proposed in this pull request?

This PR proposes to bump python libraries (pandas to 2.0.3, pyarrow to 
4.0.0) in Docker image for release script.

### Why are the changes needed?

Without this change, release script (do-release-docker.sh) fails on docs 
phase. Changing this fixes the release process against branch-3.5.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Confirmed with dry-run of release script against branch-3.5.

`dev/create-release/do-release-docker.sh -d ~/spark-release -n -s docs`

```
Generating HTML files for SQL API documentation.
INFO-  Cleaning site directory
INFO-  Building documentation to directory: 
/opt/spark-rm/output/spark/sql/site
INFO-  Documentation built in 0.85 seconds
/opt/spark-rm/output/spark/sql
Moving back into docs dir.
Making directory api/sql
cp -r ../sql/site/. api/sql
Source: /opt/spark-rm/output/spark/docs
   Destination: /opt/spark-rm/output/spark/docs/_site
 Incremental build: disabled. Enable with --incremental
  Generating...
done in 7.469 seconds.
 Auto-regeneration: disabled. Use --watch to enable.
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45111 from HeartSaVioR/SPARK-46906-3.5.

Authored-by: Jungtaek Lim 
Signed-off-by: Jungtaek Lim 
---
 dev/create-release/spark-rm/Dockerfile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/dev/create-release/spark-rm/Dockerfile 
b/dev/create-release/spark-rm/Dockerfile
index cd57226f5e01..789915d018de 100644
--- a/dev/create-release/spark-rm/Dockerfile
+++ b/dev/create-release/spark-rm/Dockerfile
@@ -42,7 +42,7 @@ ARG APT_INSTALL="apt-get install --no-install-recommends -y"
 #   We should use the latest Sphinx version once this is fixed.
 # TODO(SPARK-35375): Jinja2 3.0.0+ causes error when building with Sphinx.
 #   See also https://issues.apache.org/jira/browse/SPARK-35375.
-ARG PIP_PKGS="sphinx==3.0.4 mkdocs==1.1.2 numpy==1.20.3 
pydata_sphinx_theme==0.8.0 ipython==7.19.0 nbsphinx==0.8.0 numpydoc==1.1.0 
jinja2==2.11.3 twine==3.4.1 sphinx-plotly-directive==0.1.3 
sphinx-copybutton==0.5.2 pandas==1.5.3 pyarrow==3.0.0 plotly==5.4.0 
markupsafe==2.0.1 docutils<0.17 grpcio==1.56.0 protobuf==4.21.6 
grpcio-status==1.56.0 googleapis-common-protos==1.56.4"
+ARG PIP_PKGS="sphinx==3.0.4 mkdocs==1.1.2 numpy==1.20.3 
pydata_sphinx_theme==0.8.0 ipython==7.19.0 nbsphinx==0.8.0 numpydoc==1.1.0 
jinja2==2.11.3 twine==3.4.1 sphinx-plotly-directive==0.1.3 
sphinx-copybutton==0.5.2 pandas==2.0.3 pyarrow==4.0.0 plotly==5.4.0 
markupsafe==2.0.1 docutils<0.17 grpcio==1.56.0 protobuf==4.21.6 
grpcio-status==1.56.0 googleapis-common-protos==1.56.4"
 ARG GEM_PKGS="bundler:2.3.8"
 
 # Install extra needed repos and refresh.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-47051][INFRA] Create a new test pipeline for `yarn` and `connect`

2024-02-14 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 1b48de5606fb [SPARK-47051][INFRA] Create a new test pipeline for 
`yarn` and `connect`
1b48de5606fb is described below

commit 1b48de5606fbdb26b4459dee0aa94be6560ef14a
Author: Dongjoon Hyun 
AuthorDate: Wed Feb 14 21:12:26 2024 -0800

[SPARK-47051][INFRA] Create a new test pipeline for `yarn` and `connect`

### What changes were proposed in this pull request?

This PR aims to spin off `yarn` and `connect` as a new test pipeline for 
the following:
- To stabilize more by off-loading
- To re-trigger easily in case of failures.
- To isolate `yarn` module change and avoid triggering other module's tests 
like Kafka module.
- To isolate `connect` module change and avoid triggering other module's 
tests like Kafka module.

### Why are the changes needed?

These two modules are known to be flaky in various GitHub Action CI 
pipelines.
- https://github.com/apache/spark/actions/runs/7905202256/job/21577289425 
(`YarnClusterSuite`)
- https://github.com/apache/spark/actions/runs/7905202256/job/21585092863 
(`SparkSessionE2ESuite`)
- https://github.com/apache/spark/actions/runs/7828944523/job/21359886644 
(`SparkSessionE2ESuite`)
- https://github.com/apache/spark/actions/runs/7795415730/job/21258341216 
(`SparkSessionE2ESuite`)
- https://github.com/apache/spark/actions/runs/7858754074/job/21444107806 
(`SparkSessionE2ESuite`)
- https://github.com/apache/spark/actions/runs/7879934827/job/21501133320 
(`SparkConnectServiceSuite`)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs in this PR.

![Screenshot 2024-02-14 at 15 44 
56](https://github.com/apache/spark/assets/9700541/6e735420-914d-44d3-b037-112c3e98d0e6)

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45107 from dongjoon-hyun/SPARK-47051.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 1d98727a4231..43903d139d1f 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -147,8 +147,9 @@ jobs:
 mllib-local, mllib, graphx
   - >-
 streaming, sql-kafka-0-10, streaming-kafka-0-10, 
streaming-kinesis-asl,
-yarn, kubernetes, hadoop-cloud, spark-ganglia-lgpl,
-connect, protobuf
+kubernetes, hadoop-cloud, spark-ganglia-lgpl, protobuf
+  - >-
+yarn, connect
 # Here, we split Hive and SQL tests into some of slow ones and the 
rest of them.
 included-tags: [""]
 excluded-tags: [""]


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch branch-3.5 updated: Revert "[SPARK-45396][PYTHON] Add doc entry for `pyspark.ml.connect` module, and adds `Evaluator` to `__all__` at `ml.connect`"

2024-02-14 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new ea6b25767fb8 Revert "[SPARK-45396][PYTHON] Add doc entry for 
`pyspark.ml.connect` module, and adds `Evaluator` to `__all__` at `ml.connect`"
ea6b25767fb8 is described below

commit ea6b25767fb86732c108c759fd5393caee22f129
Author: Hyukjin Kwon 
AuthorDate: Thu Feb 15 09:20:57 2024 +0900

Revert "[SPARK-45396][PYTHON] Add doc entry for `pyspark.ml.connect` 
module, and adds `Evaluator` to `__all__` at `ml.connect`"

This reverts commit 35b627a934b1ab28be7d6ba88fdad63dc129525a.
---
 python/docs/source/reference/index.rst |   1 -
 .../docs/source/reference/pyspark.ml.connect.rst   | 122 -
 python/pyspark/ml/connect/__init__.py  |   3 +-
 3 files changed, 1 insertion(+), 125 deletions(-)

diff --git a/python/docs/source/reference/index.rst 
b/python/docs/source/reference/index.rst
index 6330636839cd..ed3eb4d07dac 100644
--- a/python/docs/source/reference/index.rst
+++ b/python/docs/source/reference/index.rst
@@ -31,7 +31,6 @@ Pandas API on Spark follows the API specifications of latest 
pandas release.
pyspark.pandas/index
pyspark.ss/index
pyspark.ml
-   pyspark.ml.connect
pyspark.streaming
pyspark.mllib
pyspark
diff --git a/python/docs/source/reference/pyspark.ml.connect.rst 
b/python/docs/source/reference/pyspark.ml.connect.rst
deleted file mode 100644
index 1a3e6a593980..
--- a/python/docs/source/reference/pyspark.ml.connect.rst
+++ /dev/null
@@ -1,122 +0,0 @@
-..  Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-..http://www.apache.org/licenses/LICENSE-2.0
-
-..  Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
-
-
-MLlib (DataFrame-based) for Spark Connect
-=
-
-.. warning::
-The namespace for this package can change in the future Spark version.
-
-
-Pipeline APIs
--
-
-.. currentmodule:: pyspark.ml.connect
-
-.. autosummary::
-:template: autosummary/class_with_docs.rst
-:toctree: api/
-
-Transformer
-Estimator
-Model
-Evaluator
-Pipeline
-PipelineModel
-
-
-Feature

-
-.. currentmodule:: pyspark.ml.connect.feature
-
-.. autosummary::
-:template: autosummary/class_with_docs.rst
-:toctree: api/
-
-MaxAbsScaler
-MaxAbsScalerModel
-StandardScaler
-StandardScalerModel
-
-
-Classification
---
-
-.. currentmodule:: pyspark.ml.connect.classification
-
-.. autosummary::
-:template: autosummary/class_with_docs.rst
-:toctree: api/
-
-LogisticRegression
-LogisticRegressionModel
-
-
-Functions
--
-
-.. currentmodule:: pyspark.ml.connect.functions
-
-.. autosummary::
-:toctree: api/
-
-array_to_vector
-vector_to_array
-
-
-Tuning
---
-
-.. currentmodule:: pyspark.ml.connect.tuning
-
-.. autosummary::
-:template: autosummary/class_with_docs.rst
-:toctree: api/
-
-CrossValidator
-CrossValidatorModel
-
-
-Evaluation
---
-
-.. currentmodule:: pyspark.ml.connect.evaluation
-
-.. autosummary::
-:template: autosummary/class_with_docs.rst
-:toctree: api/
-
-RegressionEvaluator
-BinaryClassificationEvaluator
-MulticlassClassificationEvaluator
-
-
-Utilities
--
-
-.. currentmodule:: pyspark.ml.connect.io_utils
-
-.. autosummary::
-:template: autosummary/class_with_docs.rst
-:toctree: api/
-
-ParamsReadWrite
-CoreModelReadWrite
-MetaAlgorithmReadWrite
-
diff --git a/python/pyspark/ml/connect/__init__.py 
b/python/pyspark/ml/connect/__init__.py
index e6115a62ccfe..2ee152f6a38a 100644
--- a/python/pyspark/ml/connect/__init__.py
+++ b/python/pyspark/ml/connect/__init__.py
@@ -28,14 +28,13 @@ from pyspark.ml.connect import (
 evaluation,
 tuning,
 )
-from pyspark.ml.connect.evaluation import Evaluator
 
 from pyspark.ml.connect.pipeline import Pipeline, PipelineModel
 
 __all__ = [
 "Estimator",
 "Transformer",
-"Evaluator",
+"Estimator",
 "Model",
 "feature",
 "evaluation",


-
To unsu

(spark) branch master updated: [SPARK-47049][BUILD] Ban non-shaded Hadoop dependencies

2024-02-14 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 736d8ab3f00e [SPARK-47049][BUILD] Ban non-shaded Hadoop dependencies
736d8ab3f00e is described below

commit 736d8ab3f00e7c5ba1b01c22f6398b636b8492ea
Author: Dongjoon Hyun 
AuthorDate: Wed Feb 14 14:30:40 2024 -0800

[SPARK-47049][BUILD] Ban non-shaded Hadoop dependencies

### What changes were proposed in this pull request?

This PR aims to ban `non-shaded` Hadoop dependencies (including transitive 
ones).

### Why are the changes needed?

SPARK-33212 moved to shaded Hadoop dependencies at Apache Spark 3.2.0. This 
PR will make it sure that we don't have any accidental leftovers.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45106 from dongjoon-hyun/SPARK-47049.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 pom.xml | 4 
 1 file changed, 4 insertions(+)

diff --git a/pom.xml b/pom.xml
index 0b6a6955b18b..b83378af30ff 100644
--- a/pom.xml
+++ b/pom.xml
@@ -2869,6 +2869,10 @@
   
   
 
+  org.apache.hadoop:hadoop-common
+  org.apache.hadoop:hadoop-hdfs-client
+  
org.apache.hadoop:hadoop-mapreduce-client-core
+  
org.apache.hadoop:hadoop-mapreduce-client-jobclient
   org.jboss.netty
   org.codehaus.groovy
   *:*_2.12


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-47014][PYTHON][CONNECT] Implement methods dumpPerfProfiles and dumpMemoryProfiles of SparkSession

2024-02-14 Thread xinrong
This is an automated email from the ASF dual-hosted git repository.

xinrong pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 4b9e9d7a9b7c [SPARK-47014][PYTHON][CONNECT] Implement methods 
dumpPerfProfiles and dumpMemoryProfiles of SparkSession
4b9e9d7a9b7c is described below

commit 4b9e9d7a9b7c1b21c7d04cdf0095cc069a35b757
Author: Xinrong Meng 
AuthorDate: Wed Feb 14 10:37:33 2024 -0800

[SPARK-47014][PYTHON][CONNECT] Implement methods dumpPerfProfiles and 
dumpMemoryProfiles of SparkSession

### What changes were proposed in this pull request?
Implement methods dumpPerfProfiles and dumpMemoryProfiles of SparkSession

### Why are the changes needed?
Complete support of (v2) SparkSession-based profiling.

### Does this PR introduce _any_ user-facing change?
Yes. dumpPerfProfiles and dumpMemoryProfiles of SparkSession are supported.

An example of dumpPerfProfiles is shown below.

```py
>>> udf("long")
... def add(x):
...   return x + 1
...
>>> spark.conf.set("spark.sql.pyspark.udf.profiler", "perf")
>>> spark.range(10).select(add("id")).collect()
...
>>> spark.dumpPerfProfiles("dummy_dir")
>>> os.listdir("dummy_dir")
['udf_2.pstats']
```

### How was this patch tested?
Unit tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #45073 from xinrong-meng/dump_profile.

Authored-by: Xinrong Meng 
Signed-off-by: Xinrong Meng 
---
 python/pyspark/sql/connect/session.py | 10 +
 python/pyspark/sql/profiler.py| 65 +++
 python/pyspark/sql/session.py | 10 +
 python/pyspark/sql/tests/test_udf_profiler.py | 20 +
 python/pyspark/tests/test_memory_profiler.py  | 22 +
 5 files changed, 110 insertions(+), 17 deletions(-)

diff --git a/python/pyspark/sql/connect/session.py 
b/python/pyspark/sql/connect/session.py
index 9a678c28a6cc..764f71ccc415 100644
--- a/python/pyspark/sql/connect/session.py
+++ b/python/pyspark/sql/connect/session.py
@@ -958,6 +958,16 @@ class SparkSession:
 
 showMemoryProfiles.__doc__ = PySparkSession.showMemoryProfiles.__doc__
 
+def dumpPerfProfiles(self, path: str, id: Optional[int] = None) -> None:
+self._profiler_collector.dump_perf_profiles(path, id)
+
+dumpPerfProfiles.__doc__ = PySparkSession.dumpPerfProfiles.__doc__
+
+def dumpMemoryProfiles(self, path: str, id: Optional[int] = None) -> None:
+self._profiler_collector.dump_memory_profiles(path, id)
+
+dumpMemoryProfiles.__doc__ = PySparkSession.dumpMemoryProfiles.__doc__
+
 
 SparkSession.__doc__ = PySparkSession.__doc__
 
diff --git a/python/pyspark/sql/profiler.py b/python/pyspark/sql/profiler.py
index 565752197238..0db9d9b8b9b4 100644
--- a/python/pyspark/sql/profiler.py
+++ b/python/pyspark/sql/profiler.py
@@ -15,6 +15,7 @@
 # limitations under the License.
 #
 from abc import ABC, abstractmethod
+import os
 import pstats
 from threading import RLock
 from typing import Dict, Optional, TYPE_CHECKING
@@ -158,6 +159,70 @@ class ProfilerCollector(ABC):
 """
 ...
 
+def dump_perf_profiles(self, path: str, id: Optional[int] = None) -> None:
+"""
+Dump the perf profile results into directory `path`.
+
+.. versionadded:: 4.0.0
+
+Parameters
+--
+path: str
+A directory in which to dump the perf profile.
+id : int, optional
+A UDF ID to be shown. If not specified, all the results will be 
shown.
+"""
+with self._lock:
+stats = self._perf_profile_results
+
+def dump(id: int) -> None:
+s = stats.get(id)
+
+if s is not None:
+if not os.path.exists(path):
+os.makedirs(path)
+p = os.path.join(path, f"udf_{id}_perf.pstats")
+s.dump_stats(p)
+
+if id is not None:
+dump(id)
+else:
+for id in sorted(stats.keys()):
+dump(id)
+
+def dump_memory_profiles(self, path: str, id: Optional[int] = None) -> 
None:
+"""
+Dump the memory profile results into directory `path`.
+
+.. versionadded:: 4.0.0
+
+Parameters
+--
+path: str
+A directory in which to dump the memory profile.
+id : int, optional
+A UDF ID to be shown. If not specified, all the results will be 
shown.
+"""
+with self._lock:
+code_map = self._memory_profile_results
+
+def dump(id: int) -> None:
+cm = code_map.get(id)
+
+if cm is not None:
+if not os.path.exists(path):
+os.makedirs(path)
+p 

(spark) branch master updated (7e911cdd0344 -> c1321c01eeea)

2024-02-14 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 7e911cdd0344 [SPARK-47039][TESTS] Add a checkstyle rule to ban 
`commons-lang` in Java code
 add c1321c01eeea [SPARK-47038][BUILD] Remove shaded `protobuf-java` 2.6.1 
dependency from `kinesis-asl-assembly`

No new revisions were added by this update.

Summary of changes:
 connector/kinesis-asl-assembly/pom.xml | 19 ---
 1 file changed, 19 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-47039][TESTS] Add a checkstyle rule to ban `commons-lang` in Java code

2024-02-14 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 7e911cdd0344 [SPARK-47039][TESTS] Add a checkstyle rule to ban 
`commons-lang` in Java code
7e911cdd0344 is described below

commit 7e911cdd0344f164cc6a2976fa832d50589b3a2c
Author: Dongjoon Hyun 
AuthorDate: Wed Feb 14 09:41:09 2024 -0800

[SPARK-47039][TESTS] Add a checkstyle rule to ban `commons-lang` in Java 
code

### What changes were proposed in this pull request?

This PR aims to add a checkstyle rule to ban `commons-lang` in Java code in 
favor of `commons-lang3`.

### Why are the changes needed?

SPARK-16129 banned `commons-lang` in Scala code since Apache Spark 2.0.0.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45097 from dongjoon-hyun/SPARK-47039.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 dev/checkstyle-suppressions.xml | 2 ++
 dev/checkstyle.xml  | 1 +
 2 files changed, 3 insertions(+)

diff --git a/dev/checkstyle-suppressions.xml b/dev/checkstyle-suppressions.xml
index 37c03759ad5e..7b20dfb6bce5 100644
--- a/dev/checkstyle-suppressions.xml
+++ b/dev/checkstyle-suppressions.xml
@@ -62,4 +62,6 @@
   
files="sql/api/src/main/java/org/apache/spark/sql/streaming/Trigger.java"/>
 
+
 
diff --git a/dev/checkstyle.xml b/dev/checkstyle.xml
index 5af15318081a..b9997d2050d1 100644
--- a/dev/checkstyle.xml
+++ b/dev/checkstyle.xml
@@ -186,6 +186,7 @@
 
 
 
+
 
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46832][SQL] Introducing Collate and Collation expressions

2024-02-14 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 861cca3da4c4 [SPARK-46832][SQL] Introducing Collate and Collation 
expressions
861cca3da4c4 is described below

commit 861cca3da4c446761ccff007c89b214a691b0a72
Author: Aleksandar Tomic 
AuthorDate: Wed Feb 14 19:14:50 2024 +0300

[SPARK-46832][SQL] Introducing Collate and Collation expressions

### What changes were proposed in this pull request?

This PR adds E2E support for `collate` and `collation` expressions.
Following changes were made to get us there:
1) Set the right ordering for `PhysicalStringType` based on `collationId`.
2) UTF8String is now just a data holder class - it no longer implements 
`Comparable` interface. All comparisons must be done through `CollationFactory`.
3) `collate` and `collation` expressions are added. Special syntax for 
`collate` is enabled - `'hello world' COLLATE 'target_collation'
4) First set of tests is added that covers both core expression and E2E 
collation tests.

### Why are the changes needed?

This PR is part of larger collation track. For more details please refer to 
design doc attached in parent JIRA ticket.

### Does this PR introduce _any_ user-facing change?

This test adds two new expressions and opens up new syntax.

### How was this patch tested?

Basic tests are added. In follow up PRs we will add support for more 
advanced operators and keep adding tests alongside new feature support.

### Was this patch authored or co-authored using generative AI tooling?

Yes.

Closes #45064 from dbatomic/stringtype_compare.

Lead-authored-by: Aleksandar Tomic 
Co-authored-by: Stefan Kandic 
Signed-off-by: Max Gekk 
---
 .../spark/sql/catalyst/util/CollationFactory.java  |   5 +-
 .../org/apache/spark/unsafe/types/UTF8String.java  |  59 ++-
 .../apache/spark/unsafe/types/UTF8StringSuite.java |  24 +--
 .../types/UTF8StringPropertyCheckSuite.scala   |   2 +-
 .../spark/sql/catalyst/parser/SqlBaseParser.g4 |   1 +
 .../spark/sql/catalyst/encoders/RowEncoder.scala   |   2 +-
 .../org/apache/spark/sql/types/StringType.scala|  23 ++-
 .../sql/catalyst/CatalystTypeConverters.scala  |   2 +-
 .../sql/catalyst/analysis/FunctionRegistry.scala   |   2 +
 .../spark/sql/catalyst/encoders/EncoderUtils.scala |   2 +-
 .../sql/catalyst/expressions/ToStringBase.scala|   4 +-
 .../aggregate/BloomFilterAggregate.scala   |   4 +-
 .../expressions/codegen/CodeGenerator.scala|  13 +-
 .../expressions/collationExpressions.scala | 100 
 .../spark/sql/catalyst/parser/AstBuilder.scala |   8 +
 .../sql/catalyst/types/PhysicalDataType.scala  |   4 +-
 .../catalyst/expressions/CodeGenerationSuite.scala |   9 +-
 .../expressions/CollationExpressionSuite.scala |  77 +
 .../apache/spark/sql/execution/HiveResult.scala|   2 +-
 .../spark/sql/execution/columnar/ColumnStats.scala |   4 +-
 .../sql-functions/sql-expression-schema.md |   2 +
 .../org/apache/spark/sql/CollationSuite.scala  | 177 +
 .../sql/expressions/ExpressionInfoSuite.scala  |   5 +-
 23 files changed, 484 insertions(+), 47 deletions(-)

diff --git 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
index 018fb6cbeb9f..83cac849e848 100644
--- 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
+++ 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
@@ -112,7 +112,7 @@ public final class CollationFactory {
 collationTable[0] = new Collation(
   "UCS_BASIC",
   null,
-  UTF8String::compareTo,
+  UTF8String::binaryCompare,
   "1.0",
   s -> (long)s.hashCode(),
   true);
@@ -122,7 +122,7 @@ public final class CollationFactory {
 collationTable[1] = new Collation(
   "UCS_BASIC_LCASE",
   null,
-  Comparator.comparing(UTF8String::toLowerCase),
+  (s1, s2) -> s1.toLowerCase().binaryCompare(s2.toLowerCase()),
   "1.0",
   (s) -> (long)s.toLowerCase().hashCode(),
   false);
@@ -132,7 +132,6 @@ public final class CollationFactory {
   "UNICODE", Collator.getInstance(ULocale.ROOT), "153.120.0.0", true);
 collationTable[2].collator.setStrength(Collator.TERTIARY);
 
-
 // UNICODE case-insensitive comparison (ROOT locale, in ICU + Secondary 
strength).
 collationTable[3] = new Collation(
   "UNICODE_CI", Collator.getInstance(ULocale.ROOT), "153.120.0.0", false);
diff --git 
a/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java 
b/common/unsafe/src/main/java/o

(spark) branch master updated: [SPARK-47040][CONNECT] Allow Spark Connect Server Script to wait

2024-02-14 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a6bed5e9bcc5 [SPARK-47040][CONNECT] Allow Spark Connect Server Script 
to wait
a6bed5e9bcc5 is described below

commit a6bed5e9bcc54dac51421263d5ef73c0b6e0b12c
Author: Martin Grund 
AuthorDate: Wed Feb 14 03:03:30 2024 -0800

[SPARK-47040][CONNECT] Allow Spark Connect Server Script to wait

### What changes were proposed in this pull request?
Add an option to the command line of `./sbin/start-connect-server.sh` that 
leaves it running in the foreground for easier debugging.

```
./sbin/start-connect-server.sh --wait
```

### Why are the changes needed?
Usability

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Manual

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #45090 from grundprinzip/start_server_wait.

Authored-by: Martin Grund 
Signed-off-by: Dongjoon Hyun 
---
 sbin/start-connect-server.sh | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/sbin/start-connect-server.sh b/sbin/start-connect-server.sh
index a347f43db8b1..fecda717eb34 100755
--- a/sbin/start-connect-server.sh
+++ b/sbin/start-connect-server.sh
@@ -38,4 +38,10 @@ fi
 
 . "${SPARK_HOME}/bin/load-spark-env.sh"
 
-exec "${SPARK_HOME}"/sbin/spark-daemon.sh submit $CLASS 1 --name "Spark 
Connect server" "$@"
+if [ "$1" == "--wait" ]; then
+  shift
+  exec "${SPARK_HOME}"/bin/spark-submit --class $CLASS 1 --name "Spark Connect 
Server" "$@"
+else
+  exec "${SPARK_HOME}"/sbin/spark-daemon.sh submit $CLASS 1 --name "Spark 
Connect server" "$@"
+fi
+


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org