date:20240510

svn commit: r69098 - /dev/spark/v4.0.0-preview1-rc1-bin/

2024-05-10 Thread wenchen

Author: wenchen
Date: Sat May 11 04:28:26 2024
New Revision: 69098

Log:
Apache Spark v4.0.0-preview1-rc1

Added:
dev/spark/v4.0.0-preview1-rc1-bin/
dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz   (with 
props)
dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz.asc
dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz.sha512
dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz   (with props)
dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.asc
dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.sha512
dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz.sha512
dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-hadoop3.tgz   
(with props)
dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-hadoop3.tgz.asc

dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-hadoop3.tgz.sha512

dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-without-hadoop.tgz   
(with props)

dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-without-hadoop.tgz.asc

dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-without-hadoop.tgz.sha512
dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1.tgz   (with props)
dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1.tgz.asc
dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1.tgz.sha512

Added: dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz
==
Binary file - no diff available.

Propchange: dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz
--
svn:mime-type = application/octet-stream

Added: dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz.asc
==
--- dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz.asc (added)
+++ dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz.asc Sat May 
11 04:28:26 2024
@@ -0,0 +1,17 @@
+-BEGIN PGP SIGNATURE-
+
+iQJHBAABCgAxFiEETclnbO+ag+mPygJ4TWYghDzYf1oFAmY+8UQTHHdlbmNoZW5A
+YXBhY2hlLm9yZwAKCRBNZiCEPNh/WkV1D/44BoMRwBQPQybc9ldlemMhKNQ/1OLB
+mUwhLpeUryOpUjO8AXa60YBajHqg9hivRxAUiuoaBSn7HjWY+3+nwkbcA7ZyMaV2
+Hgvfu4orB2kYXx4JgiE+dd2Zbuq+HFTv32dDUe+FyiHvhFw/bL0TIYUNJfKNcBtq
+KZDl9K5wemNjmpUSQAfEh3/vkikv5xOGxV+yEohgpB3t5Wg3hTETISXLfx/mHDu5
+GPjdCZ1omcqxZsV16CFZHV/uzK5aEDXfPdo2OO5V94xyQL0EQaMnzzMUdHkxPJ3p
+747tTf/q5rXHOb7S67MtNoBZ8myR23mQGJTwlV6E8CJWcbH7R6SEHekG9kIPGd3i
+UHoBAmroi+KfAdRej2Nqvz7SfeDeAmFw2kBRIm42FYWIqalAqbKU9LlXSpjyvYkO
+82df+5mwOzJf5VSU9D3krmjqWMFdjlLbDI1O1hLMNHyZkCYzPf+pmFhABsfGMXZH
+D8vURqF5aL9BmEuwi1SF0zSa9bI0otQj0DBvCbZnUeULSHB+P/eFqHoXjtNX2ArB
+43zmyaDywfqPXoMItvb+sGGUvatbLTCjjl6yfwgZEKOHs5noCygmL1WoLVQV+UYe
+UXb/hOJrP4FdUARpnMmz6R0NYSgQ7RZ7lOjQqs3VB7W1ashh0EWDD1hbeqMpvdx/
++fBbOLMrdzxifw==
+=2il7
+-END PGP SIGNATURE-

Added: dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz.sha512
==
--- dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz.sha512 
(added)
+++ dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz.sha512 Sat 
May 11 04:28:26 2024
@@ -0,0 +1 @@
+60c0f5348da36d3399b596648e104202b2e9925a5b52694bf83cec9be1b4e78db6b4aa7f2f9257bca74dd514dca176ab6b51ab4c0abad2b31fb3fc5b5c14
  SparkR_4.0.0-preview1.tar.gz

Added: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz
==
Binary file - no diff available.

Propchange: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz
--
svn:mime-type = application/octet-stream

Added: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.asc
==
--- dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.asc (added)
+++ dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.asc Sat May 11 
04:28:26 2024
@@ -0,0 +1,17 @@
+-BEGIN PGP SIGNATURE-
+
+iQJHBAABCgAxFiEETclnbO+ag+mPygJ4TWYghDzYf1oFAmY+8UYTHHdlbmNoZW5A
+YXBhY2hlLm9yZwAKCRBNZiCEPNh/WsnjD/4m0Dyb8ZcxS/JScvFxl3eg7KRWi8d8
+bGHs/pHZxdwS/HUkBRtv0w6HXJV6ZtQW1CPtbZ0VKOqElUfGPS/VaxE91I7c2Vmb
++/P2/buVX6fBlF+vIUPECyVgblnhBeZKbBb5Wcz3xpL1Jfj/6qi3o9uLnFFfy55S
+N6FWIJ5xrjl9mlo6+s4qqL/06u982NaEyUsu51eNgapTQcNUAjFKme13WC3W7n0S
+i6ixtW1oXmfY74CzSfn6KNC+5QvxKwJznS7ZxrG3g/chcaR8rApUZ526v4XL7LP0
+BDNeqCI+blAjVYFUzBIkvZp8SR/BbJv2HSySq5hbf0S6l0O+iuj8tZ/oa8Z0hCNf
+lXUw2ORG7RJKUZePdC+F+vYrmISyDRiWb4ddSUAjkzXy8KEWw6y55VULCq4vHbDc
+1Zwmf2izaujavcSJMjBnMhoZZ1PBlxgVQwHYu0Pi3qLCxyIn4oTd1wW7h6u5IGMr
++1LjMaGCrKbWSafp+cXGtzfJGjzPjCdIN2HqX6l53Vli4jn8I8yGJZs7hp+SZ281

svn commit: r69097 - /dev/spark/v4.0.0-preview1-rc1-bin/

2024-05-10 Thread wenchen

Author: wenchen
Date: Sat May 11 03:59:33 2024
New Revision: 69097

Log:
prepare for re-uploading

Removed:
dev/spark/v4.0.0-preview1-rc1-bin/


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of `test-dependencies.sh` script

2024-05-10 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f699f556d8a0 [SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of 
`test-dependencies.sh` script
f699f556d8a0 is described below

commit f699f556d8a09bb755e9c8558661a36fbdb42e73
Author: panbingkun 
AuthorDate: Fri May 10 19:54:29 2024 -0700

[SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of 
`test-dependencies.sh` script

### What changes were proposed in this pull request?
The pr aims to delete the dir `dev/pr-deps` after executing 
`test-dependencies.sh`.

### Why are the changes needed?
We'd better clean the `temporary files` generated at the end.
Before:
```
sh dev/test-dependencies.sh
```
https://github.com/apache/spark/assets/15246973/39a56983-774c-4c2d-897d-26a7d0999456;>

After:
```
sh dev/test-dependencies.sh
```
https://github.com/apache/spark/assets/15246973/f7e76e22-63cf-4411-99d0-5e844f8d5a7a;>

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manually test.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46531 from panbingkun/minor_test-dependencies.

Authored-by: panbingkun 
Signed-off-by: Dongjoon Hyun 
---
 dev/test-dependencies.sh | 4 
 1 file changed, 4 insertions(+)

diff --git a/dev/test-dependencies.sh b/dev/test-dependencies.sh
index 048c59f4cec9..e645a66165a2 100755
--- a/dev/test-dependencies.sh
+++ b/dev/test-dependencies.sh
@@ -140,4 +140,8 @@ for HADOOP_HIVE_PROFILE in "${HADOOP_HIVE_PROFILES[@]}"; do
   fi
 done
 
+if [[ -d "$FWDIR/dev/pr-deps" ]]; then
+  rm -rf "$FWDIR/dev/pr-deps"
+fi
+
 exit 0


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.4 updated: [SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of `test-dependencies.sh` script

2024-05-10 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 1e0fc1ef96aa [SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of 
`test-dependencies.sh` script
1e0fc1ef96aa is described below

commit 1e0fc1ef96aa6f541134224f1ba626f234442e74
Author: panbingkun 
AuthorDate: Fri May 10 19:54:29 2024 -0700

[SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of 
`test-dependencies.sh` script

### What changes were proposed in this pull request?
The pr aims to delete the dir `dev/pr-deps` after executing 
`test-dependencies.sh`.

### Why are the changes needed?
We'd better clean the `temporary files` generated at the end.
Before:
```
sh dev/test-dependencies.sh
```
https://github.com/apache/spark/assets/15246973/39a56983-774c-4c2d-897d-26a7d0999456;>

After:
```
sh dev/test-dependencies.sh
```
https://github.com/apache/spark/assets/15246973/f7e76e22-63cf-4411-99d0-5e844f8d5a7a;>

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manually test.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46531 from panbingkun/minor_test-dependencies.

Authored-by: panbingkun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit f699f556d8a09bb755e9c8558661a36fbdb42e73)
Signed-off-by: Dongjoon Hyun 
---
 dev/test-dependencies.sh | 4 
 1 file changed, 4 insertions(+)

diff --git a/dev/test-dependencies.sh b/dev/test-dependencies.sh
index 2268a262d5f8..2907ef27189c 100755
--- a/dev/test-dependencies.sh
+++ b/dev/test-dependencies.sh
@@ -144,4 +144,8 @@ for HADOOP_HIVE_PROFILE in "${HADOOP_HIVE_PROFILES[@]}"; do
   fi
 done
 
+if [[ -d "$FWDIR/dev/pr-deps" ]]; then
+  rm -rf "$FWDIR/dev/pr-deps"
+fi
+
 exit 0


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of `test-dependencies.sh` script

2024-05-10 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new e9a1b4254419 [SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of 
`test-dependencies.sh` script
e9a1b4254419 is described below

commit e9a1b4254419c751e612cd5e5c56f111b41399e7
Author: panbingkun 
AuthorDate: Fri May 10 19:54:29 2024 -0700

[SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of 
`test-dependencies.sh` script

### What changes were proposed in this pull request?
The pr aims to delete the dir `dev/pr-deps` after executing 
`test-dependencies.sh`.

### Why are the changes needed?
We'd better clean the `temporary files` generated at the end.
Before:
```
sh dev/test-dependencies.sh
```
https://github.com/apache/spark/assets/15246973/39a56983-774c-4c2d-897d-26a7d0999456;>

After:
```
sh dev/test-dependencies.sh
```
https://github.com/apache/spark/assets/15246973/f7e76e22-63cf-4411-99d0-5e844f8d5a7a;>

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manually test.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46531 from panbingkun/minor_test-dependencies.

Authored-by: panbingkun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit f699f556d8a09bb755e9c8558661a36fbdb42e73)
Signed-off-by: Dongjoon Hyun 
---
 dev/test-dependencies.sh | 4 
 1 file changed, 4 insertions(+)

diff --git a/dev/test-dependencies.sh b/dev/test-dependencies.sh
index d7967ac3afa9..36cc7a4f994d 100755
--- a/dev/test-dependencies.sh
+++ b/dev/test-dependencies.sh
@@ -140,4 +140,8 @@ for HADOOP_HIVE_PROFILE in "${HADOOP_HIVE_PROFILES[@]}"; do
   fi
 done
 
+if [[ -d "$FWDIR/dev/pr-deps" ]]; then
+  rm -rf "$FWDIR/dev/pr-deps"
+fi
+
 exit 0


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48205][SQL][FOLLOWUP] Add missing tags for the dataSource API

2024-05-10 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d82458f15539 [SPARK-48205][SQL][FOLLOWUP] Add missing tags for the 
dataSource API
d82458f15539 is described below

commit d82458f15539eef8df320345a7c2382ca4d5be8a
Author: allisonwang-db 
AuthorDate: Fri May 10 16:31:47 2024 -0700

[SPARK-48205][SQL][FOLLOWUP] Add missing tags for the dataSource API

### What changes were proposed in this pull request?

This is a follow-up PR for https://github.com/apache/spark/pull/46487 to 
add missing tags for the `dataSource` API.

### Why are the changes needed?

To address comments from a previous PR.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46530 from allisonwang-db/spark-48205-followup.

Authored-by: allisonwang-db 
Signed-off-by: Dongjoon Hyun 
---
 sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala | 4 
 1 file changed, 4 insertions(+)

diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
index d5de74455dce..466e4cf81318 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
@@ -233,7 +233,11 @@ class SparkSession private(
 
   /**
* A collection of methods for registering user-defined data sources.
+   *
+   * @since 4.0.0
*/
+  @Experimental
+  @Unstable
   def dataSource: DataSourceRegistration = sessionState.dataSourceRegistration
 
   /**


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48236][BUILD] Add `commons-lang:commons-lang:2.6` back to support legacy Hive UDF jars

2024-05-10 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 5b3b8a90638c [SPARK-48236][BUILD] Add `commons-lang:commons-lang:2.6` 
back to support legacy Hive UDF jars
5b3b8a90638c is described below

commit 5b3b8a90638c49fc7ddcace69a85989c1053f1ab
Author: Dongjoon Hyun 
AuthorDate: Fri May 10 15:48:08 2024 -0700

[SPARK-48236][BUILD] Add `commons-lang:commons-lang:2.6` back to support 
legacy Hive UDF jars

### What changes were proposed in this pull request?

This PR aims to add `commons-lang:commons-lang:2.6` back to support legacy 
Hive UDF jars . This is a partial revert of SPARK-47018 .

### Why are the changes needed?

Recently, we dropped `commons-lang:commons-lang` during Hive upgrade.
- #46468

However, only Apache Hive 2.3.10 or 4.0.0 dropped it. In other words, Hive 
2.0.0 ~ 2.3.9 and Hive 3.0.0 ~ 3.1.3 requires it. As a result, all existing  
UDF jars built against those versions requires `commons-lang:commons-lang` 
still.

- https://github.com/apache/hive/pull/4892

For example, Apache Hive 3.1.3 code:
- 
https://github.com/apache/hive/blob/af7059e2bdc8b18af42e0b7f7163b923a0bfd424/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFTrim.java#L21
```
import org.apache.commons.lang.StringUtils;
```

- 
https://github.com/apache/hive/blob/af7059e2bdc8b18af42e0b7f7163b923a0bfd424/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFTrim.java#L42
```
return StringUtils.strip(val, " ");
```

As a result, Maven CIs are broken.
- https://github.com/apache/spark/actions/runs/9032639456/job/24825599546 
(Maven / Java 17)
- https://github.com/apache/spark/actions/runs/9033374547/job/24835284769 
(Maven / Java 21)

The root cause is that the existing test UDF jar `hive-test-udfs.jar` was 
built from old Hive (before 2.3.10) libraries which requires 
`commons-lang:commons-lang:2.6`.
```
HiveUDFDynamicLoadSuite:
- Spark should be able to run Hive UDF using jar regardless of current 
thread context classloader (UDF
20:21:25.129 WARN org.apache.spark.SparkContext: The JAR 
file:///home/runner/work/spark/spark/sql/hive/src/test/noclasspath/hive-test-udfs.jar
 at spark://localhost:33327/jars/hive-test-udfs.jar has been added already. 
Overwriting of added jar is not supported in the current version.

*** RUN ABORTED ***
A needed class was not found. This could be due to an error in your 
runpath. Missing class: org/apache/commons/lang/StringUtils
  java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils
  at 
org.apache.hadoop.hive.contrib.udf.example.GenericUDFTrim2.performOp(GenericUDFTrim2.java:43)
  at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseTrim.evaluate(GenericUDFBaseTrim.java:75)
  at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:170)
  at 
org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118)
  at 
org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117)
  at 
org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132)
  at 
org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:184)
  at 
org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:164)
  at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:185)
  ...
  Cause: java.lang.ClassNotFoundException: 
org.apache.commons.lang.StringUtils
  at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:593)
  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:526)
  at 
org.apache.hadoop.hive.contrib.udf.example.GenericUDFTrim2.performOp(GenericUDFTrim2.java:43)
  at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseTrim.evaluate(GenericUDFBaseTrim.java:75)
  at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:170)
  at 
org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118)
  at 
org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117)
  at 
org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132)
  at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132)
  ...
```

### Does this PR introduce _any_ user-facing change?

To support the existing customer

(spark) branch master updated: Revert "[SPARK-48230][BUILD] Remove unused `jodd-core`"

2024-05-10 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 726ef8aa66ea Revert "[SPARK-48230][BUILD] Remove unused `jodd-core`"
726ef8aa66ea is described below

commit 726ef8aa66ea6e56b739f3b16f99e457a0febb81
Author: Dongjoon Hyun 
AuthorDate: Fri May 10 15:34:12 2024 -0700

Revert "[SPARK-48230][BUILD] Remove unused `jodd-core`"

This reverts commit d8151186d79459fbde27a01bd97328e73548c55a.
---
 LICENSE-binary|  1 +
 dev/deps/spark-deps-hadoop-3-hive-2.3 |  1 +
 licenses-binary/LICENSE-jodd.txt  | 24 
 pom.xml   |  6 ++
 sql/hive/pom.xml  |  4 
 5 files changed, 36 insertions(+)

diff --git a/LICENSE-binary b/LICENSE-binary
index 034215f0ab15..40271c9924bc 100644
--- a/LICENSE-binary
+++ b/LICENSE-binary
@@ -436,6 +436,7 @@ com.esotericsoftware:reflectasm
 org.codehaus.janino:commons-compiler
 org.codehaus.janino:janino
 jline:jline
+org.jodd:jodd-core
 com.github.wendykierp:JTransforms
 pl.edu.icm:JLargeArrays
 
diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index 29997815e5bc..392bacd73277 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -143,6 +143,7 @@ jline/2.14.6//jline-2.14.6.jar
 jline/3.24.1//jline-3.24.1.jar
 jna/5.13.0//jna-5.13.0.jar
 joda-time/2.12.7//joda-time-2.12.7.jar
+jodd-core/3.5.2//jodd-core-3.5.2.jar
 jpam/1.1//jpam-1.1.jar
 json/1.8//json-1.8.jar
 json4s-ast_2.13/4.0.7//json4s-ast_2.13-4.0.7.jar
diff --git a/licenses-binary/LICENSE-jodd.txt b/licenses-binary/LICENSE-jodd.txt
new file mode 100644
index ..cc6b458adb38
--- /dev/null
+++ b/licenses-binary/LICENSE-jodd.txt
@@ -0,0 +1,24 @@
+Copyright (c) 2003-present, Jodd Team (https://jodd.org)
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+
+1. Redistributions of source code must retain the above copyright notice,
+this list of conditions and the following disclaimer.
+
+2. Redistributions in binary form must reproduce the above copyright
+notice, this list of conditions and the following disclaimer in the
+documentation and/or other materials provided with the distribution.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
+LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGE.
\ No newline at end of file
diff --git a/pom.xml b/pom.xml
index a98efe8aed1e..56a34cedde51 100644
--- a/pom.xml
+++ b/pom.xml
@@ -201,6 +201,7 @@
 3.1.9
 3.0.12
 2.12.7
+3.5.2
 3.0.0
 2.2.11
 0.16.0
@@ -2782,6 +2783,11 @@
 joda-time
 ${joda.version}
   
+  
+org.jodd
+jodd-core
+${jodd.version}
+  
   
 org.datanucleus
 datanucleus-core
diff --git a/sql/hive/pom.xml b/sql/hive/pom.xml
index 5e9fc256e7e6..3895d9dc5a63 100644
--- a/sql/hive/pom.xml
+++ b/sql/hive/pom.xml
@@ -152,6 +152,10 @@
   joda-time
   joda-time
 
+
+  org.jodd
+  jodd-core
+
 
   com.google.code.findbugs
   jsr305


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (a6632ffa16f6 -> 2225aa1dab0f)

2024-05-10 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from a6632ffa16f6 [SPARK-48143][SQL] Use lightweight exceptions for 
control-flow between UnivocityParser and FailureSafeParser
 add 2225aa1dab0f [SPARK-48144][SQL] Fix `canPlanAsBroadcastHashJoin` to 
respect shuffle join hints

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/optimizer/joins.scala   | 38 ++
 .../spark/sql/execution/SparkStrategies.scala  | 17 --
 .../scala/org/apache/spark/sql/JoinSuite.scala | 26 +--
 3 files changed, 55 insertions(+), 26 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r69092 - in /dev/spark/v4.0.0-preview1-rc1-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/R/articles/ _site/api/R/articles/sparkr-vignettes_files/ _site/api/R/articles/sparkr-vignettes_

2024-05-10 Thread wenchen

Author: wenchen
Date: Fri May 10 16:44:08 2024
New Revision: 69092

Log:
Apache Spark v4.0.0-preview1-rc1 docs


[This commit notification would consist of 4810 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48143][SQL] Use lightweight exceptions for control-flow between UnivocityParser and FailureSafeParser

2024-05-10 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a6632ffa16f6 [SPARK-48143][SQL] Use lightweight exceptions for 
control-flow between UnivocityParser and FailureSafeParser
a6632ffa16f6 is described below

commit a6632ffa16f6907eba96e745920d571924bf4b63
Author: Vladimir Golubev 
AuthorDate: Sat May 11 00:37:54 2024 +0800

[SPARK-48143][SQL] Use lightweight exceptions for control-flow between 
UnivocityParser and FailureSafeParser

# What changes were proposed in this pull request?
New lightweight exception for control-flow between UnivocityParser and 
FalureSafeParser to speed-up malformed CSV parsing.

This is a different way to implement these reverted changes: 
https://github.com/apache/spark/pull/46478

The previous implementation was more invasive - removing `cause` from 
`BadRecordException` could break upper code, which unwraps errors and checks 
the types of the causes. This implementation only touches `FailureSafeParser` 
and `UnivocityParser` since in the codebase they are always used together, 
unlike `JacksonParser` and `StaxXmlParser`. Removing stacktrace from 
`BadRecordException` is safe, since the cause itself has an adequate stacktrace 
(except pure control-flow cases).

### Why are the changes needed?
Parsing in `PermissiveMode` is slow due to heavy exception construction 
(stacktrace filling + string template substitution in `SparkRuntimeException`)

### Does this PR introduce _any_ user-facing change?
No, since `FailureSafeParser` unwraps `BadRecordException` and correctly 
rethrows user-facing exceptions in `FailFastMode`

### How was this patch tested?
- `testOnly org.apache.spark.sql.catalyst.csv.UnivocityParserSuite`
- Manually run csv benchmark
- Manually checked correct and malformed csv in sherk-shell 
(org.apache.spark.SparkException is thrown with the stacktrace)

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46500 from 
vladimirg-db/vladimirg-db/use-special-lighweight-exception-for-control-flow-between-univocity-parser-and-failure-safe-parser.

Authored-by: Vladimir Golubev 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/csv/UnivocityParser.scala   |  5 +++--
 .../sql/catalyst/util/BadRecordException.scala | 22 +++---
 .../sql/catalyst/util/FailureSafeParser.scala  | 11 +--
 3 files changed, 31 insertions(+), 7 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
index a5158d8a22c6..4d95097e1681 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
@@ -316,7 +316,7 @@ class UnivocityParser(
   throw BadRecordException(
 () => getCurrentInput,
 () => Array.empty,
-QueryExecutionErrors.malformedCSVRecordError(""))
+LazyBadRecordCauseWrapper(() => 
QueryExecutionErrors.malformedCSVRecordError("")))
 }
 
 val currentInput = getCurrentInput
@@ -326,7 +326,8 @@ class UnivocityParser(
   // However, we still have chance to parse some of the tokens. It 
continues to parses the
   // tokens normally and sets null when `ArrayIndexOutOfBoundsException` 
occurs for missing
   // tokens.
-  Some(QueryExecutionErrors.malformedCSVRecordError(currentInput.toString))
+  Some(LazyBadRecordCauseWrapper(
+() => 
QueryExecutionErrors.malformedCSVRecordError(currentInput.toString)))
 } else None
 // When the length of the returned tokens is identical to the length of 
the parsed schema,
 // we just need to:
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/BadRecordException.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/BadRecordException.scala
index 65a56c1064e4..654b0b8c73e5 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/BadRecordException.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/BadRecordException.scala
@@ -67,16 +67,32 @@ case class PartialResultArrayException(
   extends Exception(cause)
 
 /**
- * Exception thrown when the underlying parser meet a bad record and can't 
parse it.
+ * Exception thrown when the underlying parser met a bad record and can't 
parse it.
+ * The stacktrace is not collected for better preformance, and thus, this 
exception should
+ * not be used in a user-facing context.
  * @param record a function to return the record that cause the parser to fail
  * @param partialResults a function that returns an row

(spark) branch master updated: [SPARK-47793][TEST][FOLLOWUP] Fix flaky test for Python data source exactly once

2024-05-10 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 5beaf85cd5ef [SPARK-47793][TEST][FOLLOWUP] Fix flaky test for Python 
data source exactly once
5beaf85cd5ef is described below

commit 5beaf85cd5ef2b84a67ebce712e8d73d1e7d41ff
Author: Chaoqin Li 
AuthorDate: Fri May 10 08:24:42 2024 -0700

[SPARK-47793][TEST][FOLLOWUP] Fix flaky test for Python data source exactly 
once

### What changes were proposed in this pull request?
Fix the flakiness in python streaming source exactly once test. The last 
executed batch may not be recorded in query progress, which cause the expected 
rows doesn't match. This fix takes the uncompleted batch into account and relax 
the condition

### Why are the changes needed?
Fix flaky test.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Test change.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46481 from chaoqin-li1123/fix_python_ds_test.

Authored-by: Chaoqin Li 
Signed-off-by: Dongjoon Hyun 
---
 .../execution/python/PythonStreamingDataSourceSuite.scala| 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonStreamingDataSourceSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonStreamingDataSourceSuite.scala
index 97e6467c3eaf..d1f7c597b308 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonStreamingDataSourceSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonStreamingDataSourceSuite.scala
@@ -299,7 +299,7 @@ class PythonStreamingDataSourceSuite extends 
PythonDataSourceSuiteBase {
   val checkpointDir = new File(path, "checkpoint")
   val outputDir = new File(path, "output")
   val df = spark.readStream.format(dataSourceName).load()
-  var lastBatch = 0
+  var lastBatchId = 0
   // Restart streaming query multiple times to verify exactly once 
guarantee.
   for (i <- 1 to 5) {
 
@@ -323,11 +323,15 @@ class PythonStreamingDataSourceSuite extends 
PythonDataSourceSuiteBase {
 }
 q.stop()
 q.awaitTermination()
-lastBatch = q.lastProgress.batchId.toInt
+lastBatchId = q.lastProgress.batchId.toInt
   }
-  assert(lastBatch > 20)
+  assert(lastBatchId > 20)
+  val rowCount = 
spark.read.format("json").load(outputDir.getAbsolutePath).count()
+  // There may be one uncommitted batch that is not recorded in query 
progress.
+  // The number of batch can be lastBatchId + 1 or lastBatchId + 2.
+  assert(rowCount == 2 * (lastBatchId + 1) || rowCount == 2 * (lastBatchId 
+ 2))
   checkAnswer(spark.read.format("json").load(outputDir.getAbsolutePath),
-(0 to  2 * lastBatch + 1).map(Row(_)))
+(0 until rowCount.toInt).map(Row(_)))
 }
   }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-47441][YARN] Do not add log link for unmanaged AM in Spark UI

2024-05-10 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c5b6ec734bd0 [SPARK-47441][YARN] Do not add log link for unmanaged AM 
in Spark UI
c5b6ec734bd0 is described below

commit c5b6ec734bd0c47551b59f9de13c6323b80974b2
Author: Yuming Wang 
AuthorDate: Fri May 10 08:22:03 2024 -0700

[SPARK-47441][YARN] Do not add log link for unmanaged AM in Spark UI

### What changes were proposed in this pull request?

This PR makes it do not add log link for unmanaged AM in Spark UI.

### Why are the changes needed?

Avoid start driver error messages:
```
24/03/18 04:58:25,022 ERROR [spark-listener-group-appStatus] 
scheduler.AsyncEventQueue:97 : Listener AppStatusListener threw an exception
java.lang.NumberFormatException: For input string: "null"
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:67) 
~[?:?]
at java.lang.Integer.parseInt(Integer.java:668) ~[?:?]
at java.lang.Integer.parseInt(Integer.java:786) ~[?:?]
at 
scala.collection.immutable.StringLike.toInt(StringLike.scala:310) 
~[scala-library-2.12.18.jar:?]
at 
scala.collection.immutable.StringLike.toInt$(StringLike.scala:310) 
~[scala-library-2.12.18.jar:?]
at scala.collection.immutable.StringOps.toInt(StringOps.scala:33) 
~[scala-library-2.12.18.jar:?]
at org.apache.spark.util.Utils$.parseHostPort(Utils.scala:1105) 
~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.status.ProcessSummaryWrapper.(storeTypes.scala:609) 
~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.status.LiveMiscellaneousProcess.doUpdate(LiveEntity.scala:1045)
 ~[spark-core_2.12-3.5.1.jar:3.5.1]
at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:50) 
~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1233) 
~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.status.AppStatusListener.onMiscellaneousProcessAdded(AppStatusListener.scala:1445)
 ~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.status.AppStatusListener.onOtherEvent(AppStatusListener.scala:113)
 ~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100)
 ~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
 ~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
 ~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
 ~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) 
~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) 
~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
 ~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
 ~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) 
~[scala-library-2.12.18.jar:?]
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) 
~[scala-library-2.12.18.jar:?]
at 
org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
 ~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
 ~[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1356) 
[spark-core_2.12-3.5.1.jar:3.5.1]
at 
org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
 [spark-core_2.12-3.5.1.jar:3.5.1]
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual testing:
```shell
bin/spark-sql --master yarn  --conf spark.yarn.unmanagedAM.enabled=true
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45565 from wangyum/SPARK-47441.

Authored-by: Yuming Wang 
Signed-off-by: Dongjoon Hyun 
---
 .../main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala   | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git

(spark) branch master updated: [SPARK-48235][SQL] Directly pass join instead of all arguments to getBroadcastBuildSide and getShuffleHashJoinBuildSide

2024-05-10 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 73bb619d45b2 [SPARK-48235][SQL] Directly pass join instead of all 
arguments to getBroadcastBuildSide and getShuffleHashJoinBuildSide
73bb619d45b2 is described below

commit 73bb619d45b2d0699ca4a9d251eea57c359f275b
Author: fred-db 
AuthorDate: Fri May 10 07:45:28 2024 -0700

[SPARK-48235][SQL] Directly pass join instead of all arguments to 
getBroadcastBuildSide and getShuffleHashJoinBuildSide

### What changes were proposed in this pull request?

* Refactor getBroadcastBuildSide and getShuffleHashJoinBuildSide to pass 
the join as argument instead of all member variables of the join separately.

### Why are the changes needed?

* Makes to code easier to read.

### Does this PR introduce _any_ user-facing change?

* no

### How was this patch tested?

* Existing UTs

### Was this patch authored or co-authored using generative AI tooling?

* No

Closes #46525 from fred-db/parameter-change.

Authored-by: fred-db 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/catalyst/optimizer/joins.scala   | 56 +---
 .../optimizer/JoinSelectionHelperSuite.scala   | 59 +-
 .../spark/sql/execution/SparkStrategies.scala  |  6 +--
 3 files changed, 40 insertions(+), 81 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala
index 2b4ee033b088..5571178832db 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala
@@ -289,58 +289,52 @@ case object BuildLeft extends BuildSide
 trait JoinSelectionHelper {
 
   def getBroadcastBuildSide(
-  left: LogicalPlan,
-  right: LogicalPlan,
-  joinType: JoinType,
-  hint: JoinHint,
+  join: Join,
   hintOnly: Boolean,
   conf: SQLConf): Option[BuildSide] = {
 val buildLeft = if (hintOnly) {
-  hintToBroadcastLeft(hint)
+  hintToBroadcastLeft(join.hint)
 } else {
-  canBroadcastBySize(left, conf) && !hintToNotBroadcastLeft(hint)
+  canBroadcastBySize(join.left, conf) && !hintToNotBroadcastLeft(join.hint)
 }
 val buildRight = if (hintOnly) {
-  hintToBroadcastRight(hint)
+  hintToBroadcastRight(join.hint)
 } else {
-  canBroadcastBySize(right, conf) && !hintToNotBroadcastRight(hint)
+  canBroadcastBySize(join.right, conf) && 
!hintToNotBroadcastRight(join.hint)
 }
 getBuildSide(
-  canBuildBroadcastLeft(joinType) && buildLeft,
-  canBuildBroadcastRight(joinType) && buildRight,
-  left,
-  right
+  canBuildBroadcastLeft(join.joinType) && buildLeft,
+  canBuildBroadcastRight(join.joinType) && buildRight,
+  join.left,
+  join.right
 )
   }
 
   def getShuffleHashJoinBuildSide(
-  left: LogicalPlan,
-  right: LogicalPlan,
-  joinType: JoinType,
-  hint: JoinHint,
+  join: Join,
   hintOnly: Boolean,
   conf: SQLConf): Option[BuildSide] = {
 val buildLeft = if (hintOnly) {
-  hintToShuffleHashJoinLeft(hint)
+  hintToShuffleHashJoinLeft(join.hint)
 } else {
-  hintToPreferShuffleHashJoinLeft(hint) ||
-(!conf.preferSortMergeJoin && canBuildLocalHashMapBySize(left, conf) &&
-  muchSmaller(left, right, conf)) ||
+  hintToPreferShuffleHashJoinLeft(join.hint) ||
+(!conf.preferSortMergeJoin && canBuildLocalHashMapBySize(join.left, 
conf) &&
+  muchSmaller(join.left, join.right, conf)) ||
 forceApplyShuffledHashJoin(conf)
 }
 val buildRight = if (hintOnly) {
-  hintToShuffleHashJoinRight(hint)
+  hintToShuffleHashJoinRight(join.hint)
 } else {
-  hintToPreferShuffleHashJoinRight(hint) ||
-(!conf.preferSortMergeJoin && canBuildLocalHashMapBySize(right, conf) 
&&
-  muchSmaller(right, left, conf)) ||
+  hintToPreferShuffleHashJoinRight(join.hint) ||
+(!conf.preferSortMergeJoin && canBuildLocalHashMapBySize(join.right, 
conf) &&
+  muchSmaller(join.right, join.left, conf)) ||
 forceApplyShuffledHashJoin(conf)
 }
 getBuildSide(
-  canBuildShuffledHashJoinLeft(joinType) && buildLeft,
-  canBuildShuffledHashJoinRight(joinType) && buildRight,
-  left,
-  right
+  canBuildShuffledHashJoinLeft(join.joinType) && buildLeft,
+  canBuildShuffledHashJoinRight(join.joinType) && buildRight,
+  join.left,
+  join.right
 )
   }
 
@@ -401,10 +395,8 @@ trait JoinSelectionHelper {
   }
 
   def canPlanAsBroadcastHashJoin(join: Join, conf: SQLConf):

(spark) branch master updated: [SPARK-48146][SQL] Fix aggregate function in With expression child assertion

2024-05-10 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 7ef0440ef221 [SPARK-48146][SQL] Fix aggregate function in With 
expression child assertion
7ef0440ef221 is described below

commit 7ef0440ef22161a6160f7b9000c70b26c84eecf7
Author: Kelvin Jiang 
AuthorDate: Fri May 10 22:39:15 2024 +0800

[SPARK-48146][SQL] Fix aggregate function in With expression child assertion

### What changes were proposed in this pull request?

In https://github.com/apache/spark/pull/46034, there was a complicated edge 
case where common expression references in aggregate functions in the child of 
a `With` expression could become dangling. An assertion was added to avoid that 
case from happening, but the assertion wasn't fully accurate as a query like:
```
select
  id between max(if(id between 1 and 2, 2, 1)) over () and id
from range(10)
```
would fail the assertion.

This PR fixes the assertion to be more accurate.

### Why are the changes needed?

This addresses a regression in https://github.com/apache/spark/pull/46034.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46443 from kelvinjian-db/SPARK-48146-agg.

Authored-by: Kelvin Jiang 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/expressions/With.scala  | 26 +
 .../optimizer/RewriteWithExpressionSuite.scala | 27 +-
 2 files changed, 48 insertions(+), 5 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/With.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/With.scala
index 14deedd9c70f..29794b33641c 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/With.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/With.scala
@@ -17,7 +17,8 @@
 
 package org.apache.spark.sql.catalyst.expressions
 
-import org.apache.spark.sql.catalyst.trees.TreePattern.{AGGREGATE_EXPRESSION, 
COMMON_EXPR_REF, TreePattern, WITH_EXPRESSION}
+import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
+import org.apache.spark.sql.catalyst.trees.TreePattern.{COMMON_EXPR_REF, 
TreePattern, WITH_EXPRESSION}
 import org.apache.spark.sql.types.DataType
 
 /**
@@ -27,9 +28,11 @@ import org.apache.spark.sql.types.DataType
  */
 case class With(child: Expression, defs: Seq[CommonExpressionDef])
   extends Expression with Unevaluable {
-  // We do not allow With to be created with an AggregateExpression in the 
child, as this would
-  // create a dangling CommonExpressionRef after rewriting it in 
RewriteWithExpression.
-  assert(!child.containsPattern(AGGREGATE_EXPRESSION))
+  // We do not allow creating a With expression with an AggregateExpression 
that contains a
+  // reference to a common expression defined in that scope (note that it can 
contain another With
+  // expression with a common expression ref of the inner With). This is to 
prevent the creation of
+  // a dangling CommonExpressionRef after rewriting it in 
RewriteWithExpression.
+  assert(!With.childContainsUnsupportedAggExpr(this))
 
   override val nodePatterns: Seq[TreePattern] = Seq(WITH_EXPRESSION)
   override def dataType: DataType = child.dataType
@@ -92,6 +95,21 @@ object With {
 val commonExprRefs = commonExprDefs.map(new CommonExpressionRef(_))
 With(replaced(commonExprRefs), commonExprDefs)
   }
+
+  private[sql] def childContainsUnsupportedAggExpr(withExpr: With): Boolean = {
+lazy val commonExprIds = withExpr.defs.map(_.id).toSet
+withExpr.child.exists {
+  case agg: AggregateExpression =>
+// Check that the aggregate expression does not contain a reference to 
a common expression
+// in the outer With expression (it is ok if it contains a reference 
to a common expression
+// for a nested With expression).
+agg.exists {
+  case r: CommonExpressionRef => commonExprIds.contains(r.id)
+  case _ => false
+}
+  case _ => false
+}
+  }
 }
 
 case class CommonExpressionId(id: Long = CommonExpressionId.newId, 
canonicalized: Boolean = false) {
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpressionSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpressionSuite.scala
index d482b18d9331..8f023fa4156b 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpressionSuite.scala
+++

(spark) branch master updated: [SPARK-48228][PYTHON][CONNECT][FOLLOWUP] Also apply `_validate_pandas_udf` in MapInXXX

2024-05-10 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 259760a5c5e2 [SPARK-48228][PYTHON][CONNECT][FOLLOWUP] Also apply 
`_validate_pandas_udf` in MapInXXX
259760a5c5e2 is described below

commit 259760a5c5e26e33b2ee46282aeb63e4ea701020
Author: Ruifeng Zheng 
AuthorDate: Fri May 10 18:44:53 2024 +0800

[SPARK-48228][PYTHON][CONNECT][FOLLOWUP] Also apply `_validate_pandas_udf` 
in MapInXXX

### What changes were proposed in this pull request?
Also apply `_validate_pandas_udf`  in MapInXXX

### Why are the changes needed?
to make sure validation in `pandas_udf` is also applied in MapInXXX

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46524 from zhengruifeng/missing_check_map_in_xxx.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/connect/dataframe.py | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/python/pyspark/sql/connect/dataframe.py 
b/python/pyspark/sql/connect/dataframe.py
index 3c9415adec2d..ccaaa15f3190 100644
--- a/python/pyspark/sql/connect/dataframe.py
+++ b/python/pyspark/sql/connect/dataframe.py
@@ -83,6 +83,7 @@ from pyspark.sql.connect.expressions import (
 )
 from pyspark.sql.connect.functions import builtin as F
 from pyspark.sql.pandas.types import from_arrow_schema
+from pyspark.sql.pandas.functions import _validate_pandas_udf  # type: 
ignore[attr-defined]
 
 
 if TYPE_CHECKING:
@@ -1997,6 +1998,7 @@ class DataFrame(ParentDataFrame):
 ) -> ParentDataFrame:
 from pyspark.sql.connect.udf import UserDefinedFunction
 
+_validate_pandas_udf(func, evalType)
 udf_obj = UserDefinedFunction(
 func,
 returnType=schema,


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48232][PYTHON][TESTS] Fix 'pyspark.sql.tests.connect.test_connect_session' in Python 3.12 build

2024-05-10 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 256a23883d90 [SPARK-48232][PYTHON][TESTS] Fix 
'pyspark.sql.tests.connect.test_connect_session' in Python 3.12 build
256a23883d90 is described below

commit 256a23883d901c78cf82b4c52e3373322309b8d1
Author: Hyukjin Kwon 
AuthorDate: Fri May 10 17:12:37 2024 +0900

[SPARK-48232][PYTHON][TESTS] Fix 
'pyspark.sql.tests.connect.test_connect_session' in Python 3.12 build

### What changes were proposed in this pull request?

This PR avoids importing `scipy.sparse` directly which hangs 
indeterministically specifically with Python 3.12

### Why are the changes needed?

To fix the build with Python 3.12 
https://github.com/apache/spark/actions/runs/9022174253/job/24804919747
I was able to reproduce this in my local but a bit indeterministic.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Manually tested in my local.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46522 from HyukjinKwon/SPARK-48232.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/testing/utils.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/python/pyspark/testing/utils.py b/python/pyspark/testing/utils.py
index fe25136864ee..8a7aa405e4ac 100644
--- a/python/pyspark/testing/utils.py
+++ b/python/pyspark/testing/utils.py
@@ -38,7 +38,7 @@ from itertools import zip_longest
 have_scipy = False
 have_numpy = False
 try:
-import scipy.sparse  # noqa: F401
+import scipy  # noqa: F401
 
 have_scipy = True
 except ImportError:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48230][BUILD] Remove unused `jodd-core`

2024-05-10 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d8151186d794 [SPARK-48230][BUILD] Remove unused `jodd-core`
d8151186d794 is described below

commit d8151186d79459fbde27a01bd97328e73548c55a
Author: Cheng Pan 
AuthorDate: Fri May 10 01:09:01 2024 -0700

[SPARK-48230][BUILD] Remove unused `jodd-core`

### What changes were proposed in this pull request?

Remove a jar that has CVE https://github.com/advisories/GHSA-jrg3-qq99-35g7

### Why are the changes needed?

Previously, `jodd-core` came from Hive transitive deps, while 
https://github.com/apache/hive/pull/5151 (Hive 2.3.10) cut it out, so we can 
remove it from Spark now.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46520 from pan3793/SPARK-48230.

Authored-by: Cheng Pan 
Signed-off-by: Dongjoon Hyun 
---
 LICENSE-binary|  1 -
 dev/deps/spark-deps-hadoop-3-hive-2.3 |  1 -
 licenses-binary/LICENSE-jodd.txt  | 24 
 pom.xml   |  6 --
 sql/hive/pom.xml  |  4 
 5 files changed, 36 deletions(-)

diff --git a/LICENSE-binary b/LICENSE-binary
index 40271c9924bc..034215f0ab15 100644
--- a/LICENSE-binary
+++ b/LICENSE-binary
@@ -436,7 +436,6 @@ com.esotericsoftware:reflectasm
 org.codehaus.janino:commons-compiler
 org.codehaus.janino:janino
 jline:jline
-org.jodd:jodd-core
 com.github.wendykierp:JTransforms
 pl.edu.icm:JLargeArrays
 
diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index 392bacd73277..29997815e5bc 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -143,7 +143,6 @@ jline/2.14.6//jline-2.14.6.jar
 jline/3.24.1//jline-3.24.1.jar
 jna/5.13.0//jna-5.13.0.jar
 joda-time/2.12.7//joda-time-2.12.7.jar
-jodd-core/3.5.2//jodd-core-3.5.2.jar
 jpam/1.1//jpam-1.1.jar
 json/1.8//json-1.8.jar
 json4s-ast_2.13/4.0.7//json4s-ast_2.13-4.0.7.jar
diff --git a/licenses-binary/LICENSE-jodd.txt b/licenses-binary/LICENSE-jodd.txt
deleted file mode 100644
index cc6b458adb38..
--- a/licenses-binary/LICENSE-jodd.txt
+++ /dev/null
@@ -1,24 +0,0 @@
-Copyright (c) 2003-present, Jodd Team (https://jodd.org)
-All rights reserved.
-
-Redistribution and use in source and binary forms, with or without
-modification, are permitted provided that the following conditions are met:
-
-1. Redistributions of source code must retain the above copyright notice,
-this list of conditions and the following disclaimer.
-
-2. Redistributions in binary form must reproduce the above copyright
-notice, this list of conditions and the following disclaimer in the
-documentation and/or other materials provided with the distribution.
-
-THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
-AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
-ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
-LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
-CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
-SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
-INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
-CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
-ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
-POSSIBILITY OF SUCH DAMAGE.
\ No newline at end of file
diff --git a/pom.xml b/pom.xml
index 56a34cedde51..a98efe8aed1e 100644
--- a/pom.xml
+++ b/pom.xml
@@ -201,7 +201,6 @@
 3.1.9
 3.0.12
 2.12.7
-3.5.2
 3.0.0
 2.2.11
 0.16.0
@@ -2783,11 +2782,6 @@
 joda-time
 ${joda.version}
   
-  
-org.jodd
-jodd-core
-${jodd.version}
-  
   
 org.datanucleus
 datanucleus-core
diff --git a/sql/hive/pom.xml b/sql/hive/pom.xml
index 3895d9dc5a63..5e9fc256e7e6 100644
--- a/sql/hive/pom.xml
+++ b/sql/hive/pom.xml
@@ -152,10 +152,6 @@
   joda-time
   joda-time
 
-
-  org.jodd
-  jodd-core
-
 
   com.google.code.findbugs
   jsr305


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (33cac4436e59 -> 2df494fd4e4e)

2024-05-10 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 33cac4436e59 [SPARK-47847][CORE] Deprecate 
`spark.network.remoteReadNioBufferConversion`
 add 2df494fd4e4e [SPARK-48158][SQL] Add collation support for XML 
expressions

No new revisions were added by this update.

Summary of changes:
 .../sql/catalyst/expressions/xmlExpressions.scala  |   9 +-
 .../spark/sql/CollationSQLExpressionsSuite.scala   | 124 +
 2 files changed, 129 insertions(+), 4 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r69098 - /dev/spark/v4.0.0-preview1-rc1-bin/

svn commit: r69097 - /dev/spark/v4.0.0-preview1-rc1-bin/

(spark) branch master updated: [SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of `test-dependencies.sh` script

(spark) branch branch-3.4 updated: [SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of `test-dependencies.sh` script

(spark) branch branch-3.5 updated: [SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of `test-dependencies.sh` script

(spark) branch master updated: [SPARK-48205][SQL][FOLLOWUP] Add missing tags for the dataSource API

(spark) branch master updated: [SPARK-48236][BUILD] Add `commons-lang:commons-lang:2.6` back to support legacy Hive UDF jars

(spark) branch master updated: Revert "[SPARK-48230][BUILD] Remove unused `jodd-core`"

(spark) branch master updated (a6632ffa16f6 -> 2225aa1dab0f)

svn commit: r69092 - in /dev/spark/v4.0.0-preview1-rc1-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/R/articles/ _site/api/R/articles/sparkr-vignettes_files/ _site/api/R/articles/sparkr-vignettes_

(spark) branch master updated: [SPARK-48143][SQL] Use lightweight exceptions for control-flow between UnivocityParser and FailureSafeParser

(spark) branch master updated: [SPARK-47793][TEST][FOLLOWUP] Fix flaky test for Python data source exactly once

(spark) branch master updated: [SPARK-47441][YARN] Do not add log link for unmanaged AM in Spark UI

(spark) branch master updated: [SPARK-48235][SQL] Directly pass join instead of all arguments to getBroadcastBuildSide and getShuffleHashJoinBuildSide

(spark) branch master updated: [SPARK-48146][SQL] Fix aggregate function in With expression child assertion

(spark) branch master updated: [SPARK-48228][PYTHON][CONNECT][FOLLOWUP] Also apply `_validate_pandas_udf` in MapInXXX

(spark) branch master updated: [SPARK-48232][PYTHON][TESTS] Fix 'pyspark.sql.tests.connect.test_connect_session' in Python 3.12 build

(spark) branch master updated: [SPARK-48230][BUILD] Remove unused `jodd-core`

(spark) branch master updated (33cac4436e59 -> 2df494fd4e4e)

19 matches

Site Navigation

Mail list logo

Footer information