svn commit: r69098 - /dev/spark/v4.0.0-preview1-rc1-bin/
Author: wenchen Date: Sat May 11 04:28:26 2024 New Revision: 69098 Log: Apache Spark v4.0.0-preview1-rc1 Added: dev/spark/v4.0.0-preview1-rc1-bin/ dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz (with props) dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz.asc dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz.sha512 dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz (with props) dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.asc dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.sha512 dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz.sha512 dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-hadoop3.tgz (with props) dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-hadoop3.tgz.asc dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-hadoop3.tgz.sha512 dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-without-hadoop.tgz (with props) dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-without-hadoop.tgz.asc dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-without-hadoop.tgz.sha512 dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1.tgz (with props) dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1.tgz.asc dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1.tgz.sha512 Added: dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz == Binary file - no diff available. Propchange: dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz -- svn:mime-type = application/octet-stream Added: dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz.asc == --- dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz.asc (added) +++ dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz.asc Sat May 11 04:28:26 2024 @@ -0,0 +1,17 @@ +-BEGIN PGP SIGNATURE- + +iQJHBAABCgAxFiEETclnbO+ag+mPygJ4TWYghDzYf1oFAmY+8UQTHHdlbmNoZW5A +YXBhY2hlLm9yZwAKCRBNZiCEPNh/WkV1D/44BoMRwBQPQybc9ldlemMhKNQ/1OLB +mUwhLpeUryOpUjO8AXa60YBajHqg9hivRxAUiuoaBSn7HjWY+3+nwkbcA7ZyMaV2 +Hgvfu4orB2kYXx4JgiE+dd2Zbuq+HFTv32dDUe+FyiHvhFw/bL0TIYUNJfKNcBtq +KZDl9K5wemNjmpUSQAfEh3/vkikv5xOGxV+yEohgpB3t5Wg3hTETISXLfx/mHDu5 +GPjdCZ1omcqxZsV16CFZHV/uzK5aEDXfPdo2OO5V94xyQL0EQaMnzzMUdHkxPJ3p +747tTf/q5rXHOb7S67MtNoBZ8myR23mQGJTwlV6E8CJWcbH7R6SEHekG9kIPGd3i +UHoBAmroi+KfAdRej2Nqvz7SfeDeAmFw2kBRIm42FYWIqalAqbKU9LlXSpjyvYkO +82df+5mwOzJf5VSU9D3krmjqWMFdjlLbDI1O1hLMNHyZkCYzPf+pmFhABsfGMXZH +D8vURqF5aL9BmEuwi1SF0zSa9bI0otQj0DBvCbZnUeULSHB+P/eFqHoXjtNX2ArB +43zmyaDywfqPXoMItvb+sGGUvatbLTCjjl6yfwgZEKOHs5noCygmL1WoLVQV+UYe +UXb/hOJrP4FdUARpnMmz6R0NYSgQ7RZ7lOjQqs3VB7W1ashh0EWDD1hbeqMpvdx/ ++fBbOLMrdzxifw== +=2il7 +-END PGP SIGNATURE- Added: dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz.sha512 == --- dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz.sha512 (added) +++ dev/spark/v4.0.0-preview1-rc1-bin/SparkR_4.0.0-preview1.tar.gz.sha512 Sat May 11 04:28:26 2024 @@ -0,0 +1 @@ +60c0f5348da36d3399b596648e104202b2e9925a5b52694bf83cec9be1b4e78db6b4aa7f2f9257bca74dd514dca176ab6b51ab4c0abad2b31fb3fc5b5c14 SparkR_4.0.0-preview1.tar.gz Added: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz == Binary file - no diff available. Propchange: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz -- svn:mime-type = application/octet-stream Added: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.asc == --- dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.asc (added) +++ dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.asc Sat May 11 04:28:26 2024 @@ -0,0 +1,17 @@ +-BEGIN PGP SIGNATURE- + +iQJHBAABCgAxFiEETclnbO+ag+mPygJ4TWYghDzYf1oFAmY+8UYTHHdlbmNoZW5A +YXBhY2hlLm9yZwAKCRBNZiCEPNh/WsnjD/4m0Dyb8ZcxS/JScvFxl3eg7KRWi8d8 +bGHs/pHZxdwS/HUkBRtv0w6HXJV6ZtQW1CPtbZ0VKOqElUfGPS/VaxE91I7c2Vmb ++/P2/buVX6fBlF+vIUPECyVgblnhBeZKbBb5Wcz3xpL1Jfj/6qi3o9uLnFFfy55S +N6FWIJ5xrjl9mlo6+s4qqL/06u982NaEyUsu51eNgapTQcNUAjFKme13WC3W7n0S +i6ixtW1oXmfY74CzSfn6KNC+5QvxKwJznS7ZxrG3g/chcaR8rApUZ526v4XL7LP0 +BDNeqCI+blAjVYFUzBIkvZp8SR/BbJv2HSySq5hbf0S6l0O+iuj8tZ/oa8Z0hCNf +lXUw2ORG7RJKUZePdC+F+vYrmISyDRiWb4ddSUAjkzXy8KEWw6y55VULCq4vHbDc +1Zwmf2izaujavcSJMjBnMhoZZ1PBlxgVQwHYu0Pi3qLCxyIn4oTd1wW7h6u5IGMr ++1LjMaGCrKbWSafp+cXGtzfJGjzPjCdIN2HqX6l53Vli4jn8I8yGJZs7hp+SZ281
svn commit: r69097 - /dev/spark/v4.0.0-preview1-rc1-bin/
Author: wenchen Date: Sat May 11 03:59:33 2024 New Revision: 69097 Log: prepare for re-uploading Removed: dev/spark/v4.0.0-preview1-rc1-bin/ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of `test-dependencies.sh` script
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f699f556d8a0 [SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of `test-dependencies.sh` script f699f556d8a0 is described below commit f699f556d8a09bb755e9c8558661a36fbdb42e73 Author: panbingkun AuthorDate: Fri May 10 19:54:29 2024 -0700 [SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of `test-dependencies.sh` script ### What changes were proposed in this pull request? The pr aims to delete the dir `dev/pr-deps` after executing `test-dependencies.sh`. ### Why are the changes needed? We'd better clean the `temporary files` generated at the end. Before: ``` sh dev/test-dependencies.sh ``` https://github.com/apache/spark/assets/15246973/39a56983-774c-4c2d-897d-26a7d0999456;> After: ``` sh dev/test-dependencies.sh ``` https://github.com/apache/spark/assets/15246973/f7e76e22-63cf-4411-99d0-5e844f8d5a7a;> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46531 from panbingkun/minor_test-dependencies. Authored-by: panbingkun Signed-off-by: Dongjoon Hyun --- dev/test-dependencies.sh | 4 1 file changed, 4 insertions(+) diff --git a/dev/test-dependencies.sh b/dev/test-dependencies.sh index 048c59f4cec9..e645a66165a2 100755 --- a/dev/test-dependencies.sh +++ b/dev/test-dependencies.sh @@ -140,4 +140,8 @@ for HADOOP_HIVE_PROFILE in "${HADOOP_HIVE_PROFILES[@]}"; do fi done +if [[ -d "$FWDIR/dev/pr-deps" ]]; then + rm -rf "$FWDIR/dev/pr-deps" +fi + exit 0 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.4 updated: [SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of `test-dependencies.sh` script
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 1e0fc1ef96aa [SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of `test-dependencies.sh` script 1e0fc1ef96aa is described below commit 1e0fc1ef96aa6f541134224f1ba626f234442e74 Author: panbingkun AuthorDate: Fri May 10 19:54:29 2024 -0700 [SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of `test-dependencies.sh` script ### What changes were proposed in this pull request? The pr aims to delete the dir `dev/pr-deps` after executing `test-dependencies.sh`. ### Why are the changes needed? We'd better clean the `temporary files` generated at the end. Before: ``` sh dev/test-dependencies.sh ``` https://github.com/apache/spark/assets/15246973/39a56983-774c-4c2d-897d-26a7d0999456;> After: ``` sh dev/test-dependencies.sh ``` https://github.com/apache/spark/assets/15246973/f7e76e22-63cf-4411-99d0-5e844f8d5a7a;> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46531 from panbingkun/minor_test-dependencies. Authored-by: panbingkun Signed-off-by: Dongjoon Hyun (cherry picked from commit f699f556d8a09bb755e9c8558661a36fbdb42e73) Signed-off-by: Dongjoon Hyun --- dev/test-dependencies.sh | 4 1 file changed, 4 insertions(+) diff --git a/dev/test-dependencies.sh b/dev/test-dependencies.sh index 2268a262d5f8..2907ef27189c 100755 --- a/dev/test-dependencies.sh +++ b/dev/test-dependencies.sh @@ -144,4 +144,8 @@ for HADOOP_HIVE_PROFILE in "${HADOOP_HIVE_PROFILES[@]}"; do fi done +if [[ -d "$FWDIR/dev/pr-deps" ]]; then + rm -rf "$FWDIR/dev/pr-deps" +fi + exit 0 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of `test-dependencies.sh` script
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new e9a1b4254419 [SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of `test-dependencies.sh` script e9a1b4254419 is described below commit e9a1b4254419c751e612cd5e5c56f111b41399e7 Author: panbingkun AuthorDate: Fri May 10 19:54:29 2024 -0700 [SPARK-48237][BUILD] Clean up `dev/pr-deps` at the end of `test-dependencies.sh` script ### What changes were proposed in this pull request? The pr aims to delete the dir `dev/pr-deps` after executing `test-dependencies.sh`. ### Why are the changes needed? We'd better clean the `temporary files` generated at the end. Before: ``` sh dev/test-dependencies.sh ``` https://github.com/apache/spark/assets/15246973/39a56983-774c-4c2d-897d-26a7d0999456;> After: ``` sh dev/test-dependencies.sh ``` https://github.com/apache/spark/assets/15246973/f7e76e22-63cf-4411-99d0-5e844f8d5a7a;> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46531 from panbingkun/minor_test-dependencies. Authored-by: panbingkun Signed-off-by: Dongjoon Hyun (cherry picked from commit f699f556d8a09bb755e9c8558661a36fbdb42e73) Signed-off-by: Dongjoon Hyun --- dev/test-dependencies.sh | 4 1 file changed, 4 insertions(+) diff --git a/dev/test-dependencies.sh b/dev/test-dependencies.sh index d7967ac3afa9..36cc7a4f994d 100755 --- a/dev/test-dependencies.sh +++ b/dev/test-dependencies.sh @@ -140,4 +140,8 @@ for HADOOP_HIVE_PROFILE in "${HADOOP_HIVE_PROFILES[@]}"; do fi done +if [[ -d "$FWDIR/dev/pr-deps" ]]; then + rm -rf "$FWDIR/dev/pr-deps" +fi + exit 0 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48205][SQL][FOLLOWUP] Add missing tags for the dataSource API
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d82458f15539 [SPARK-48205][SQL][FOLLOWUP] Add missing tags for the dataSource API d82458f15539 is described below commit d82458f15539eef8df320345a7c2382ca4d5be8a Author: allisonwang-db AuthorDate: Fri May 10 16:31:47 2024 -0700 [SPARK-48205][SQL][FOLLOWUP] Add missing tags for the dataSource API ### What changes were proposed in this pull request? This is a follow-up PR for https://github.com/apache/spark/pull/46487 to add missing tags for the `dataSource` API. ### Why are the changes needed? To address comments from a previous PR. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test ### Was this patch authored or co-authored using generative AI tooling? No Closes #46530 from allisonwang-db/spark-48205-followup. Authored-by: allisonwang-db Signed-off-by: Dongjoon Hyun --- sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala | 4 1 file changed, 4 insertions(+) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala b/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala index d5de74455dce..466e4cf81318 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala @@ -233,7 +233,11 @@ class SparkSession private( /** * A collection of methods for registering user-defined data sources. + * + * @since 4.0.0 */ + @Experimental + @Unstable def dataSource: DataSourceRegistration = sessionState.dataSourceRegistration /** - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48236][BUILD] Add `commons-lang:commons-lang:2.6` back to support legacy Hive UDF jars
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 5b3b8a90638c [SPARK-48236][BUILD] Add `commons-lang:commons-lang:2.6` back to support legacy Hive UDF jars 5b3b8a90638c is described below commit 5b3b8a90638c49fc7ddcace69a85989c1053f1ab Author: Dongjoon Hyun AuthorDate: Fri May 10 15:48:08 2024 -0700 [SPARK-48236][BUILD] Add `commons-lang:commons-lang:2.6` back to support legacy Hive UDF jars ### What changes were proposed in this pull request? This PR aims to add `commons-lang:commons-lang:2.6` back to support legacy Hive UDF jars . This is a partial revert of SPARK-47018 . ### Why are the changes needed? Recently, we dropped `commons-lang:commons-lang` during Hive upgrade. - #46468 However, only Apache Hive 2.3.10 or 4.0.0 dropped it. In other words, Hive 2.0.0 ~ 2.3.9 and Hive 3.0.0 ~ 3.1.3 requires it. As a result, all existing UDF jars built against those versions requires `commons-lang:commons-lang` still. - https://github.com/apache/hive/pull/4892 For example, Apache Hive 3.1.3 code: - https://github.com/apache/hive/blob/af7059e2bdc8b18af42e0b7f7163b923a0bfd424/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFTrim.java#L21 ``` import org.apache.commons.lang.StringUtils; ``` - https://github.com/apache/hive/blob/af7059e2bdc8b18af42e0b7f7163b923a0bfd424/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFTrim.java#L42 ``` return StringUtils.strip(val, " "); ``` As a result, Maven CIs are broken. - https://github.com/apache/spark/actions/runs/9032639456/job/24825599546 (Maven / Java 17) - https://github.com/apache/spark/actions/runs/9033374547/job/24835284769 (Maven / Java 21) The root cause is that the existing test UDF jar `hive-test-udfs.jar` was built from old Hive (before 2.3.10) libraries which requires `commons-lang:commons-lang:2.6`. ``` HiveUDFDynamicLoadSuite: - Spark should be able to run Hive UDF using jar regardless of current thread context classloader (UDF 20:21:25.129 WARN org.apache.spark.SparkContext: The JAR file:///home/runner/work/spark/spark/sql/hive/src/test/noclasspath/hive-test-udfs.jar at spark://localhost:33327/jars/hive-test-udfs.jar has been added already. Overwriting of added jar is not supported in the current version. *** RUN ABORTED *** A needed class was not found. This could be due to an error in your runpath. Missing class: org/apache/commons/lang/StringUtils java.lang.NoClassDefFoundError: org/apache/commons/lang/StringUtils at org.apache.hadoop.hive.contrib.udf.example.GenericUDFTrim2.performOp(GenericUDFTrim2.java:43) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseTrim.evaluate(GenericUDFBaseTrim.java:75) at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:170) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117) at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132) at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132) at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:184) at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:164) at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:185) ... Cause: java.lang.ClassNotFoundException: org.apache.commons.lang.StringUtils at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:593) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:526) at org.apache.hadoop.hive.contrib.udf.example.GenericUDFTrim2.performOp(GenericUDFTrim2.java:43) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBaseTrim.evaluate(GenericUDFBaseTrim.java:75) at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:170) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117) at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132) at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132) ... ``` ### Does this PR introduce _any_ user-facing change? To support the existing customer
(spark) branch master updated: Revert "[SPARK-48230][BUILD] Remove unused `jodd-core`"
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 726ef8aa66ea Revert "[SPARK-48230][BUILD] Remove unused `jodd-core`" 726ef8aa66ea is described below commit 726ef8aa66ea6e56b739f3b16f99e457a0febb81 Author: Dongjoon Hyun AuthorDate: Fri May 10 15:34:12 2024 -0700 Revert "[SPARK-48230][BUILD] Remove unused `jodd-core`" This reverts commit d8151186d79459fbde27a01bd97328e73548c55a. --- LICENSE-binary| 1 + dev/deps/spark-deps-hadoop-3-hive-2.3 | 1 + licenses-binary/LICENSE-jodd.txt | 24 pom.xml | 6 ++ sql/hive/pom.xml | 4 5 files changed, 36 insertions(+) diff --git a/LICENSE-binary b/LICENSE-binary index 034215f0ab15..40271c9924bc 100644 --- a/LICENSE-binary +++ b/LICENSE-binary @@ -436,6 +436,7 @@ com.esotericsoftware:reflectasm org.codehaus.janino:commons-compiler org.codehaus.janino:janino jline:jline +org.jodd:jodd-core com.github.wendykierp:JTransforms pl.edu.icm:JLargeArrays diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 b/dev/deps/spark-deps-hadoop-3-hive-2.3 index 29997815e5bc..392bacd73277 100644 --- a/dev/deps/spark-deps-hadoop-3-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3-hive-2.3 @@ -143,6 +143,7 @@ jline/2.14.6//jline-2.14.6.jar jline/3.24.1//jline-3.24.1.jar jna/5.13.0//jna-5.13.0.jar joda-time/2.12.7//joda-time-2.12.7.jar +jodd-core/3.5.2//jodd-core-3.5.2.jar jpam/1.1//jpam-1.1.jar json/1.8//json-1.8.jar json4s-ast_2.13/4.0.7//json4s-ast_2.13-4.0.7.jar diff --git a/licenses-binary/LICENSE-jodd.txt b/licenses-binary/LICENSE-jodd.txt new file mode 100644 index ..cc6b458adb38 --- /dev/null +++ b/licenses-binary/LICENSE-jodd.txt @@ -0,0 +1,24 @@ +Copyright (c) 2003-present, Jodd Team (https://jodd.org) +All rights reserved. + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions are met: + +1. Redistributions of source code must retain the above copyright notice, +this list of conditions and the following disclaimer. + +2. Redistributions in binary form must reproduce the above copyright +notice, this list of conditions and the following disclaimer in the +documentation and/or other materials provided with the distribution. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" +AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE +LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR +CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF +SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS +INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN +CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) +ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE +POSSIBILITY OF SUCH DAMAGE. \ No newline at end of file diff --git a/pom.xml b/pom.xml index a98efe8aed1e..56a34cedde51 100644 --- a/pom.xml +++ b/pom.xml @@ -201,6 +201,7 @@ 3.1.9 3.0.12 2.12.7 +3.5.2 3.0.0 2.2.11 0.16.0 @@ -2782,6 +2783,11 @@ joda-time ${joda.version} + +org.jodd +jodd-core +${jodd.version} + org.datanucleus datanucleus-core diff --git a/sql/hive/pom.xml b/sql/hive/pom.xml index 5e9fc256e7e6..3895d9dc5a63 100644 --- a/sql/hive/pom.xml +++ b/sql/hive/pom.xml @@ -152,6 +152,10 @@ joda-time joda-time + + org.jodd + jodd-core + com.google.code.findbugs jsr305 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (a6632ffa16f6 -> 2225aa1dab0f)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from a6632ffa16f6 [SPARK-48143][SQL] Use lightweight exceptions for control-flow between UnivocityParser and FailureSafeParser add 2225aa1dab0f [SPARK-48144][SQL] Fix `canPlanAsBroadcastHashJoin` to respect shuffle join hints No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/optimizer/joins.scala | 38 ++ .../spark/sql/execution/SparkStrategies.scala | 17 -- .../scala/org/apache/spark/sql/JoinSuite.scala | 26 +-- 3 files changed, 55 insertions(+), 26 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r69092 - in /dev/spark/v4.0.0-preview1-rc1-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/R/articles/ _site/api/R/articles/sparkr-vignettes_files/ _site/api/R/articles/sparkr-vignettes_
Author: wenchen Date: Fri May 10 16:44:08 2024 New Revision: 69092 Log: Apache Spark v4.0.0-preview1-rc1 docs [This commit notification would consist of 4810 parts, which exceeds the limit of 50 ones, so it was shortened to the summary.] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48143][SQL] Use lightweight exceptions for control-flow between UnivocityParser and FailureSafeParser
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a6632ffa16f6 [SPARK-48143][SQL] Use lightweight exceptions for control-flow between UnivocityParser and FailureSafeParser a6632ffa16f6 is described below commit a6632ffa16f6907eba96e745920d571924bf4b63 Author: Vladimir Golubev AuthorDate: Sat May 11 00:37:54 2024 +0800 [SPARK-48143][SQL] Use lightweight exceptions for control-flow between UnivocityParser and FailureSafeParser # What changes were proposed in this pull request? New lightweight exception for control-flow between UnivocityParser and FalureSafeParser to speed-up malformed CSV parsing. This is a different way to implement these reverted changes: https://github.com/apache/spark/pull/46478 The previous implementation was more invasive - removing `cause` from `BadRecordException` could break upper code, which unwraps errors and checks the types of the causes. This implementation only touches `FailureSafeParser` and `UnivocityParser` since in the codebase they are always used together, unlike `JacksonParser` and `StaxXmlParser`. Removing stacktrace from `BadRecordException` is safe, since the cause itself has an adequate stacktrace (except pure control-flow cases). ### Why are the changes needed? Parsing in `PermissiveMode` is slow due to heavy exception construction (stacktrace filling + string template substitution in `SparkRuntimeException`) ### Does this PR introduce _any_ user-facing change? No, since `FailureSafeParser` unwraps `BadRecordException` and correctly rethrows user-facing exceptions in `FailFastMode` ### How was this patch tested? - `testOnly org.apache.spark.sql.catalyst.csv.UnivocityParserSuite` - Manually run csv benchmark - Manually checked correct and malformed csv in sherk-shell (org.apache.spark.SparkException is thrown with the stacktrace) ### Was this patch authored or co-authored using generative AI tooling? No Closes #46500 from vladimirg-db/vladimirg-db/use-special-lighweight-exception-for-control-flow-between-univocity-parser-and-failure-safe-parser. Authored-by: Vladimir Golubev Signed-off-by: Wenchen Fan --- .../spark/sql/catalyst/csv/UnivocityParser.scala | 5 +++-- .../sql/catalyst/util/BadRecordException.scala | 22 +++--- .../sql/catalyst/util/FailureSafeParser.scala | 11 +-- 3 files changed, 31 insertions(+), 7 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala index a5158d8a22c6..4d95097e1681 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala @@ -316,7 +316,7 @@ class UnivocityParser( throw BadRecordException( () => getCurrentInput, () => Array.empty, -QueryExecutionErrors.malformedCSVRecordError("")) +LazyBadRecordCauseWrapper(() => QueryExecutionErrors.malformedCSVRecordError(""))) } val currentInput = getCurrentInput @@ -326,7 +326,8 @@ class UnivocityParser( // However, we still have chance to parse some of the tokens. It continues to parses the // tokens normally and sets null when `ArrayIndexOutOfBoundsException` occurs for missing // tokens. - Some(QueryExecutionErrors.malformedCSVRecordError(currentInput.toString)) + Some(LazyBadRecordCauseWrapper( +() => QueryExecutionErrors.malformedCSVRecordError(currentInput.toString))) } else None // When the length of the returned tokens is identical to the length of the parsed schema, // we just need to: diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/BadRecordException.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/BadRecordException.scala index 65a56c1064e4..654b0b8c73e5 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/BadRecordException.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/BadRecordException.scala @@ -67,16 +67,32 @@ case class PartialResultArrayException( extends Exception(cause) /** - * Exception thrown when the underlying parser meet a bad record and can't parse it. + * Exception thrown when the underlying parser met a bad record and can't parse it. + * The stacktrace is not collected for better preformance, and thus, this exception should + * not be used in a user-facing context. * @param record a function to return the record that cause the parser to fail * @param partialResults a function that returns an row
(spark) branch master updated: [SPARK-47793][TEST][FOLLOWUP] Fix flaky test for Python data source exactly once
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 5beaf85cd5ef [SPARK-47793][TEST][FOLLOWUP] Fix flaky test for Python data source exactly once 5beaf85cd5ef is described below commit 5beaf85cd5ef2b84a67ebce712e8d73d1e7d41ff Author: Chaoqin Li AuthorDate: Fri May 10 08:24:42 2024 -0700 [SPARK-47793][TEST][FOLLOWUP] Fix flaky test for Python data source exactly once ### What changes were proposed in this pull request? Fix the flakiness in python streaming source exactly once test. The last executed batch may not be recorded in query progress, which cause the expected rows doesn't match. This fix takes the uncompleted batch into account and relax the condition ### Why are the changes needed? Fix flaky test. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Test change. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46481 from chaoqin-li1123/fix_python_ds_test. Authored-by: Chaoqin Li Signed-off-by: Dongjoon Hyun --- .../execution/python/PythonStreamingDataSourceSuite.scala| 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonStreamingDataSourceSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonStreamingDataSourceSuite.scala index 97e6467c3eaf..d1f7c597b308 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonStreamingDataSourceSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonStreamingDataSourceSuite.scala @@ -299,7 +299,7 @@ class PythonStreamingDataSourceSuite extends PythonDataSourceSuiteBase { val checkpointDir = new File(path, "checkpoint") val outputDir = new File(path, "output") val df = spark.readStream.format(dataSourceName).load() - var lastBatch = 0 + var lastBatchId = 0 // Restart streaming query multiple times to verify exactly once guarantee. for (i <- 1 to 5) { @@ -323,11 +323,15 @@ class PythonStreamingDataSourceSuite extends PythonDataSourceSuiteBase { } q.stop() q.awaitTermination() -lastBatch = q.lastProgress.batchId.toInt +lastBatchId = q.lastProgress.batchId.toInt } - assert(lastBatch > 20) + assert(lastBatchId > 20) + val rowCount = spark.read.format("json").load(outputDir.getAbsolutePath).count() + // There may be one uncommitted batch that is not recorded in query progress. + // The number of batch can be lastBatchId + 1 or lastBatchId + 2. + assert(rowCount == 2 * (lastBatchId + 1) || rowCount == 2 * (lastBatchId + 2)) checkAnswer(spark.read.format("json").load(outputDir.getAbsolutePath), -(0 to 2 * lastBatch + 1).map(Row(_))) +(0 until rowCount.toInt).map(Row(_))) } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47441][YARN] Do not add log link for unmanaged AM in Spark UI
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c5b6ec734bd0 [SPARK-47441][YARN] Do not add log link for unmanaged AM in Spark UI c5b6ec734bd0 is described below commit c5b6ec734bd0c47551b59f9de13c6323b80974b2 Author: Yuming Wang AuthorDate: Fri May 10 08:22:03 2024 -0700 [SPARK-47441][YARN] Do not add log link for unmanaged AM in Spark UI ### What changes were proposed in this pull request? This PR makes it do not add log link for unmanaged AM in Spark UI. ### Why are the changes needed? Avoid start driver error messages: ``` 24/03/18 04:58:25,022 ERROR [spark-listener-group-appStatus] scheduler.AsyncEventQueue:97 : Listener AppStatusListener threw an exception java.lang.NumberFormatException: For input string: "null" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:67) ~[?:?] at java.lang.Integer.parseInt(Integer.java:668) ~[?:?] at java.lang.Integer.parseInt(Integer.java:786) ~[?:?] at scala.collection.immutable.StringLike.toInt(StringLike.scala:310) ~[scala-library-2.12.18.jar:?] at scala.collection.immutable.StringLike.toInt$(StringLike.scala:310) ~[scala-library-2.12.18.jar:?] at scala.collection.immutable.StringOps.toInt(StringOps.scala:33) ~[scala-library-2.12.18.jar:?] at org.apache.spark.util.Utils$.parseHostPort(Utils.scala:1105) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.status.ProcessSummaryWrapper.(storeTypes.scala:609) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.status.LiveMiscellaneousProcess.doUpdate(LiveEntity.scala:1045) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:50) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1233) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.status.AppStatusListener.onMiscellaneousProcessAdded(AppStatusListener.scala:1445) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.status.AppStatusListener.onOtherEvent(AppStatusListener.scala:113) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) ~[spark-core_2.12-3.5.1.jar:3.5.1] at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) ~[scala-library-2.12.18.jar:?] at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) ~[scala-library-2.12.18.jar:?] at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) ~[spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1356) [spark-core_2.12-3.5.1.jar:3.5.1] at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) [spark-core_2.12-3.5.1.jar:3.5.1] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual testing: ```shell bin/spark-sql --master yarn --conf spark.yarn.unmanagedAM.enabled=true ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45565 from wangyum/SPARK-47441. Authored-by: Yuming Wang Signed-off-by: Dongjoon Hyun --- .../main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git
(spark) branch master updated: [SPARK-48235][SQL] Directly pass join instead of all arguments to getBroadcastBuildSide and getShuffleHashJoinBuildSide
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 73bb619d45b2 [SPARK-48235][SQL] Directly pass join instead of all arguments to getBroadcastBuildSide and getShuffleHashJoinBuildSide 73bb619d45b2 is described below commit 73bb619d45b2d0699ca4a9d251eea57c359f275b Author: fred-db AuthorDate: Fri May 10 07:45:28 2024 -0700 [SPARK-48235][SQL] Directly pass join instead of all arguments to getBroadcastBuildSide and getShuffleHashJoinBuildSide ### What changes were proposed in this pull request? * Refactor getBroadcastBuildSide and getShuffleHashJoinBuildSide to pass the join as argument instead of all member variables of the join separately. ### Why are the changes needed? * Makes to code easier to read. ### Does this PR introduce _any_ user-facing change? * no ### How was this patch tested? * Existing UTs ### Was this patch authored or co-authored using generative AI tooling? * No Closes #46525 from fred-db/parameter-change. Authored-by: fred-db Signed-off-by: Dongjoon Hyun --- .../spark/sql/catalyst/optimizer/joins.scala | 56 +--- .../optimizer/JoinSelectionHelperSuite.scala | 59 +- .../spark/sql/execution/SparkStrategies.scala | 6 +-- 3 files changed, 40 insertions(+), 81 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala index 2b4ee033b088..5571178832db 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala @@ -289,58 +289,52 @@ case object BuildLeft extends BuildSide trait JoinSelectionHelper { def getBroadcastBuildSide( - left: LogicalPlan, - right: LogicalPlan, - joinType: JoinType, - hint: JoinHint, + join: Join, hintOnly: Boolean, conf: SQLConf): Option[BuildSide] = { val buildLeft = if (hintOnly) { - hintToBroadcastLeft(hint) + hintToBroadcastLeft(join.hint) } else { - canBroadcastBySize(left, conf) && !hintToNotBroadcastLeft(hint) + canBroadcastBySize(join.left, conf) && !hintToNotBroadcastLeft(join.hint) } val buildRight = if (hintOnly) { - hintToBroadcastRight(hint) + hintToBroadcastRight(join.hint) } else { - canBroadcastBySize(right, conf) && !hintToNotBroadcastRight(hint) + canBroadcastBySize(join.right, conf) && !hintToNotBroadcastRight(join.hint) } getBuildSide( - canBuildBroadcastLeft(joinType) && buildLeft, - canBuildBroadcastRight(joinType) && buildRight, - left, - right + canBuildBroadcastLeft(join.joinType) && buildLeft, + canBuildBroadcastRight(join.joinType) && buildRight, + join.left, + join.right ) } def getShuffleHashJoinBuildSide( - left: LogicalPlan, - right: LogicalPlan, - joinType: JoinType, - hint: JoinHint, + join: Join, hintOnly: Boolean, conf: SQLConf): Option[BuildSide] = { val buildLeft = if (hintOnly) { - hintToShuffleHashJoinLeft(hint) + hintToShuffleHashJoinLeft(join.hint) } else { - hintToPreferShuffleHashJoinLeft(hint) || -(!conf.preferSortMergeJoin && canBuildLocalHashMapBySize(left, conf) && - muchSmaller(left, right, conf)) || + hintToPreferShuffleHashJoinLeft(join.hint) || +(!conf.preferSortMergeJoin && canBuildLocalHashMapBySize(join.left, conf) && + muchSmaller(join.left, join.right, conf)) || forceApplyShuffledHashJoin(conf) } val buildRight = if (hintOnly) { - hintToShuffleHashJoinRight(hint) + hintToShuffleHashJoinRight(join.hint) } else { - hintToPreferShuffleHashJoinRight(hint) || -(!conf.preferSortMergeJoin && canBuildLocalHashMapBySize(right, conf) && - muchSmaller(right, left, conf)) || + hintToPreferShuffleHashJoinRight(join.hint) || +(!conf.preferSortMergeJoin && canBuildLocalHashMapBySize(join.right, conf) && + muchSmaller(join.right, join.left, conf)) || forceApplyShuffledHashJoin(conf) } getBuildSide( - canBuildShuffledHashJoinLeft(joinType) && buildLeft, - canBuildShuffledHashJoinRight(joinType) && buildRight, - left, - right + canBuildShuffledHashJoinLeft(join.joinType) && buildLeft, + canBuildShuffledHashJoinRight(join.joinType) && buildRight, + join.left, + join.right ) } @@ -401,10 +395,8 @@ trait JoinSelectionHelper { } def canPlanAsBroadcastHashJoin(join: Join, conf: SQLConf):
(spark) branch master updated: [SPARK-48146][SQL] Fix aggregate function in With expression child assertion
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 7ef0440ef221 [SPARK-48146][SQL] Fix aggregate function in With expression child assertion 7ef0440ef221 is described below commit 7ef0440ef22161a6160f7b9000c70b26c84eecf7 Author: Kelvin Jiang AuthorDate: Fri May 10 22:39:15 2024 +0800 [SPARK-48146][SQL] Fix aggregate function in With expression child assertion ### What changes were proposed in this pull request? In https://github.com/apache/spark/pull/46034, there was a complicated edge case where common expression references in aggregate functions in the child of a `With` expression could become dangling. An assertion was added to avoid that case from happening, but the assertion wasn't fully accurate as a query like: ``` select id between max(if(id between 1 and 2, 2, 1)) over () and id from range(10) ``` would fail the assertion. This PR fixes the assertion to be more accurate. ### Why are the changes needed? This addresses a regression in https://github.com/apache/spark/pull/46034. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46443 from kelvinjian-db/SPARK-48146-agg. Authored-by: Kelvin Jiang Signed-off-by: Wenchen Fan --- .../spark/sql/catalyst/expressions/With.scala | 26 + .../optimizer/RewriteWithExpressionSuite.scala | 27 +- 2 files changed, 48 insertions(+), 5 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/With.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/With.scala index 14deedd9c70f..29794b33641c 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/With.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/With.scala @@ -17,7 +17,8 @@ package org.apache.spark.sql.catalyst.expressions -import org.apache.spark.sql.catalyst.trees.TreePattern.{AGGREGATE_EXPRESSION, COMMON_EXPR_REF, TreePattern, WITH_EXPRESSION} +import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression +import org.apache.spark.sql.catalyst.trees.TreePattern.{COMMON_EXPR_REF, TreePattern, WITH_EXPRESSION} import org.apache.spark.sql.types.DataType /** @@ -27,9 +28,11 @@ import org.apache.spark.sql.types.DataType */ case class With(child: Expression, defs: Seq[CommonExpressionDef]) extends Expression with Unevaluable { - // We do not allow With to be created with an AggregateExpression in the child, as this would - // create a dangling CommonExpressionRef after rewriting it in RewriteWithExpression. - assert(!child.containsPattern(AGGREGATE_EXPRESSION)) + // We do not allow creating a With expression with an AggregateExpression that contains a + // reference to a common expression defined in that scope (note that it can contain another With + // expression with a common expression ref of the inner With). This is to prevent the creation of + // a dangling CommonExpressionRef after rewriting it in RewriteWithExpression. + assert(!With.childContainsUnsupportedAggExpr(this)) override val nodePatterns: Seq[TreePattern] = Seq(WITH_EXPRESSION) override def dataType: DataType = child.dataType @@ -92,6 +95,21 @@ object With { val commonExprRefs = commonExprDefs.map(new CommonExpressionRef(_)) With(replaced(commonExprRefs), commonExprDefs) } + + private[sql] def childContainsUnsupportedAggExpr(withExpr: With): Boolean = { +lazy val commonExprIds = withExpr.defs.map(_.id).toSet +withExpr.child.exists { + case agg: AggregateExpression => +// Check that the aggregate expression does not contain a reference to a common expression +// in the outer With expression (it is ok if it contains a reference to a common expression +// for a nested With expression). +agg.exists { + case r: CommonExpressionRef => commonExprIds.contains(r.id) + case _ => false +} + case _ => false +} + } } case class CommonExpressionId(id: Long = CommonExpressionId.newId, canonicalized: Boolean = false) { diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpressionSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpressionSuite.scala index d482b18d9331..8f023fa4156b 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpressionSuite.scala +++
(spark) branch master updated: [SPARK-48228][PYTHON][CONNECT][FOLLOWUP] Also apply `_validate_pandas_udf` in MapInXXX
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 259760a5c5e2 [SPARK-48228][PYTHON][CONNECT][FOLLOWUP] Also apply `_validate_pandas_udf` in MapInXXX 259760a5c5e2 is described below commit 259760a5c5e26e33b2ee46282aeb63e4ea701020 Author: Ruifeng Zheng AuthorDate: Fri May 10 18:44:53 2024 +0800 [SPARK-48228][PYTHON][CONNECT][FOLLOWUP] Also apply `_validate_pandas_udf` in MapInXXX ### What changes were proposed in this pull request? Also apply `_validate_pandas_udf` in MapInXXX ### Why are the changes needed? to make sure validation in `pandas_udf` is also applied in MapInXXX ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #46524 from zhengruifeng/missing_check_map_in_xxx. Authored-by: Ruifeng Zheng Signed-off-by: Ruifeng Zheng --- python/pyspark/sql/connect/dataframe.py | 2 ++ 1 file changed, 2 insertions(+) diff --git a/python/pyspark/sql/connect/dataframe.py b/python/pyspark/sql/connect/dataframe.py index 3c9415adec2d..ccaaa15f3190 100644 --- a/python/pyspark/sql/connect/dataframe.py +++ b/python/pyspark/sql/connect/dataframe.py @@ -83,6 +83,7 @@ from pyspark.sql.connect.expressions import ( ) from pyspark.sql.connect.functions import builtin as F from pyspark.sql.pandas.types import from_arrow_schema +from pyspark.sql.pandas.functions import _validate_pandas_udf # type: ignore[attr-defined] if TYPE_CHECKING: @@ -1997,6 +1998,7 @@ class DataFrame(ParentDataFrame): ) -> ParentDataFrame: from pyspark.sql.connect.udf import UserDefinedFunction +_validate_pandas_udf(func, evalType) udf_obj = UserDefinedFunction( func, returnType=schema, - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48232][PYTHON][TESTS] Fix 'pyspark.sql.tests.connect.test_connect_session' in Python 3.12 build
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 256a23883d90 [SPARK-48232][PYTHON][TESTS] Fix 'pyspark.sql.tests.connect.test_connect_session' in Python 3.12 build 256a23883d90 is described below commit 256a23883d901c78cf82b4c52e3373322309b8d1 Author: Hyukjin Kwon AuthorDate: Fri May 10 17:12:37 2024 +0900 [SPARK-48232][PYTHON][TESTS] Fix 'pyspark.sql.tests.connect.test_connect_session' in Python 3.12 build ### What changes were proposed in this pull request? This PR avoids importing `scipy.sparse` directly which hangs indeterministically specifically with Python 3.12 ### Why are the changes needed? To fix the build with Python 3.12 https://github.com/apache/spark/actions/runs/9022174253/job/24804919747 I was able to reproduce this in my local but a bit indeterministic. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually tested in my local. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46522 from HyukjinKwon/SPARK-48232. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon --- python/pyspark/testing/utils.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/python/pyspark/testing/utils.py b/python/pyspark/testing/utils.py index fe25136864ee..8a7aa405e4ac 100644 --- a/python/pyspark/testing/utils.py +++ b/python/pyspark/testing/utils.py @@ -38,7 +38,7 @@ from itertools import zip_longest have_scipy = False have_numpy = False try: -import scipy.sparse # noqa: F401 +import scipy # noqa: F401 have_scipy = True except ImportError: - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48230][BUILD] Remove unused `jodd-core`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d8151186d794 [SPARK-48230][BUILD] Remove unused `jodd-core` d8151186d794 is described below commit d8151186d79459fbde27a01bd97328e73548c55a Author: Cheng Pan AuthorDate: Fri May 10 01:09:01 2024 -0700 [SPARK-48230][BUILD] Remove unused `jodd-core` ### What changes were proposed in this pull request? Remove a jar that has CVE https://github.com/advisories/GHSA-jrg3-qq99-35g7 ### Why are the changes needed? Previously, `jodd-core` came from Hive transitive deps, while https://github.com/apache/hive/pull/5151 (Hive 2.3.10) cut it out, so we can remove it from Spark now. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46520 from pan3793/SPARK-48230. Authored-by: Cheng Pan Signed-off-by: Dongjoon Hyun --- LICENSE-binary| 1 - dev/deps/spark-deps-hadoop-3-hive-2.3 | 1 - licenses-binary/LICENSE-jodd.txt | 24 pom.xml | 6 -- sql/hive/pom.xml | 4 5 files changed, 36 deletions(-) diff --git a/LICENSE-binary b/LICENSE-binary index 40271c9924bc..034215f0ab15 100644 --- a/LICENSE-binary +++ b/LICENSE-binary @@ -436,7 +436,6 @@ com.esotericsoftware:reflectasm org.codehaus.janino:commons-compiler org.codehaus.janino:janino jline:jline -org.jodd:jodd-core com.github.wendykierp:JTransforms pl.edu.icm:JLargeArrays diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 b/dev/deps/spark-deps-hadoop-3-hive-2.3 index 392bacd73277..29997815e5bc 100644 --- a/dev/deps/spark-deps-hadoop-3-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3-hive-2.3 @@ -143,7 +143,6 @@ jline/2.14.6//jline-2.14.6.jar jline/3.24.1//jline-3.24.1.jar jna/5.13.0//jna-5.13.0.jar joda-time/2.12.7//joda-time-2.12.7.jar -jodd-core/3.5.2//jodd-core-3.5.2.jar jpam/1.1//jpam-1.1.jar json/1.8//json-1.8.jar json4s-ast_2.13/4.0.7//json4s-ast_2.13-4.0.7.jar diff --git a/licenses-binary/LICENSE-jodd.txt b/licenses-binary/LICENSE-jodd.txt deleted file mode 100644 index cc6b458adb38.. --- a/licenses-binary/LICENSE-jodd.txt +++ /dev/null @@ -1,24 +0,0 @@ -Copyright (c) 2003-present, Jodd Team (https://jodd.org) -All rights reserved. - -Redistribution and use in source and binary forms, with or without -modification, are permitted provided that the following conditions are met: - -1. Redistributions of source code must retain the above copyright notice, -this list of conditions and the following disclaimer. - -2. Redistributions in binary form must reproduce the above copyright -notice, this list of conditions and the following disclaimer in the -documentation and/or other materials provided with the distribution. - -THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" -AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE -IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE -ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE -LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR -CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF -SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS -INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN -CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) -ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE -POSSIBILITY OF SUCH DAMAGE. \ No newline at end of file diff --git a/pom.xml b/pom.xml index 56a34cedde51..a98efe8aed1e 100644 --- a/pom.xml +++ b/pom.xml @@ -201,7 +201,6 @@ 3.1.9 3.0.12 2.12.7 -3.5.2 3.0.0 2.2.11 0.16.0 @@ -2783,11 +2782,6 @@ joda-time ${joda.version} - -org.jodd -jodd-core -${jodd.version} - org.datanucleus datanucleus-core diff --git a/sql/hive/pom.xml b/sql/hive/pom.xml index 3895d9dc5a63..5e9fc256e7e6 100644 --- a/sql/hive/pom.xml +++ b/sql/hive/pom.xml @@ -152,10 +152,6 @@ joda-time joda-time - - org.jodd - jodd-core - com.google.code.findbugs jsr305 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (33cac4436e59 -> 2df494fd4e4e)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 33cac4436e59 [SPARK-47847][CORE] Deprecate `spark.network.remoteReadNioBufferConversion` add 2df494fd4e4e [SPARK-48158][SQL] Add collation support for XML expressions No new revisions were added by this update. Summary of changes: .../sql/catalyst/expressions/xmlExpressions.scala | 9 +- .../spark/sql/CollationSQLExpressionsSuite.scala | 124 + 2 files changed, 129 insertions(+), 4 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org