(spark) branch branch-3.5 updated: [SPARK-47847][CORE] Deprecate `spark.network.remoteReadNioBufferConversion`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new c048653435f9 [SPARK-47847][CORE] Deprecate `spark.network.remoteReadNioBufferConversion` c048653435f9 is described below commit c048653435f9b7c832f79d38a504a145a17654c0 Author: Cheng Pan AuthorDate: Thu May 9 22:55:07 2024 -0700 [SPARK-47847][CORE] Deprecate `spark.network.remoteReadNioBufferConversion` ### What changes were proposed in this pull request? `spark.network.remoteReadNioBufferConversion` was introduced in https://github.com/apache/spark/commit/2c82745686f4456c4d5c84040a431dcb5b6cb60b, to allow disable [SPARK-24307](https://issues.apache.org/jira/browse/SPARK-24307) for safety, while during the whole Spark 3 period, there are no negative reports, it proves that [SPARK-24307](https://issues.apache.org/jira/browse/SPARK-24307) is solid enough, I propose to mark it deprecated in 3.5.2 and remove in 4.1.0 or later ### Why are the changes needed? Code clean up ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No Closes #46047 from pan3793/SPARK-47847. Authored-by: Cheng Pan Signed-off-by: Dongjoon Hyun (cherry picked from commit 33cac4436e593c9c501c5ff0eedf923d3a21899c) Signed-off-by: Dongjoon Hyun --- core/src/main/scala/org/apache/spark/SparkConf.scala | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/core/src/main/scala/org/apache/spark/SparkConf.scala b/core/src/main/scala/org/apache/spark/SparkConf.scala index 813a14acd19e..f49e9e357c84 100644 --- a/core/src/main/scala/org/apache/spark/SparkConf.scala +++ b/core/src/main/scala/org/apache/spark/SparkConf.scala @@ -638,7 +638,9 @@ private[spark] object SparkConf extends Logging { DeprecatedConfig("spark.blacklist.killBlacklistedExecutors", "3.1.0", "Please use spark.excludeOnFailure.killExcludedExecutors"), DeprecatedConfig("spark.yarn.blacklist.executor.launch.blacklisting.enabled", "3.1.0", -"Please use spark.yarn.executor.launch.excludeOnFailure.enabled") +"Please use spark.yarn.executor.launch.excludeOnFailure.enabled"), + DeprecatedConfig("spark.network.remoteReadNioBufferConversion", "3.5.2", +"Please open a JIRA ticket to report it if you need to use this configuration.") ) Map(configs.map { cfg => (cfg.key -> cfg) } : _*) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (8ccc8b92be50 -> 33cac4436e59)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 8ccc8b92be50 [SPARK-48201][DOCS][PYTHON] Make some corrections in the docstring of pyspark DataStreamReader methods add 33cac4436e59 [SPARK-47847][CORE] Deprecate `spark.network.remoteReadNioBufferConversion` No new revisions were added by this update. Summary of changes: core/src/main/scala/org/apache/spark/SparkConf.scala | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48201][DOCS][PYTHON] Make some corrections in the docstring of pyspark DataStreamReader methods
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8ccc8b92be50 [SPARK-48201][DOCS][PYTHON] Make some corrections in the docstring of pyspark DataStreamReader methods 8ccc8b92be50 is described below commit 8ccc8b92be50b1d5ef932873403e62e28c478781 Author: Chloe He AuthorDate: Thu May 9 22:07:04 2024 -0700 [SPARK-48201][DOCS][PYTHON] Make some corrections in the docstring of pyspark DataStreamReader methods ### What changes were proposed in this pull request? The docstrings of the pyspark DataStream Reader methods `csv()` and `text()` say that the `path` parameter can be a list, but actually when a list is passed an error is raised. ### Why are the changes needed? Documentation is wrong. ### Does this PR introduce _any_ user-facing change? Yes. Fixes documentation. ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? No Closes #46416 from chloeh13q/fix/streamread-docstring. Authored-by: Chloe He Signed-off-by: Dongjoon Hyun --- python/pyspark/sql/streaming/readwriter.py | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/python/pyspark/sql/streaming/readwriter.py b/python/pyspark/sql/streaming/readwriter.py index c2b75dd8f167..b202a499e8b0 100644 --- a/python/pyspark/sql/streaming/readwriter.py +++ b/python/pyspark/sql/streaming/readwriter.py @@ -553,8 +553,8 @@ class DataStreamReader(OptionUtils): Parameters -- -path : str or list -string, or list of strings, for input path(s). +path : str +string for input path. Other Parameters @@ -641,8 +641,8 @@ class DataStreamReader(OptionUtils): Parameters -- -path : str or list -string, or list of strings, for input path(s). +path : str +string for input path. schema : :class:`pyspark.sql.types.StructType` or str, optional an optional :class:`pyspark.sql.types.StructType` for the input schema or a DDL-formatted string (For example ``col0 INT, col1 DOUBLE``). - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48228][PYTHON][CONNECT] Implement the missing function validation in ApplyInXXX
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 9bb15db85e53 [SPARK-48228][PYTHON][CONNECT] Implement the missing function validation in ApplyInXXX 9bb15db85e53 is described below commit 9bb15db85e53b69b9c0ba112cd1dd93d8213eea4 Author: Ruifeng Zheng AuthorDate: Thu May 9 22:01:13 2024 -0700 [SPARK-48228][PYTHON][CONNECT] Implement the missing function validation in ApplyInXXX ### What changes were proposed in this pull request? Implement the missing function validation in ApplyInXXX https://github.com/apache/spark/pull/46397 fixed this issue for `Cogrouped.ApplyInPandas`, this PR fix remaining methods. ### Why are the changes needed? for better error message: ``` In [12]: df1 = spark.range(11) In [13]: df2 = df1.groupby("id").applyInPandas(lambda: 1, StructType([StructField("d", DoubleType())])) In [14]: df2.show() ``` before this PR, an invalid function causes weird execution errors: ``` 24/05/10 11:37:36 ERROR Executor: Exception in task 0.0 in stage 10.0 (TID 36) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1834, in main process() File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1826, in process serializer.dump_stream(out_iter, outfile) File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 531, in dump_stream return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), stream) File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 104, in dump_stream for batch in iterator: File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 524, in init_stream_yield_batches for series in iterator: File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1610, in mapper return f(keys, vals) ^ File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 488, in return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))] ^ File "/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 483, in wrapped result, return_type, _assign_cols_by_name, truncate_return_schema=False ^^ UnboundLocalError: cannot access local variable 'result' where it is not associated with a value at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:523) at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:117) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:479) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:601) at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:896) ... ``` After this PR, the error happens before execution, which is consistent with Spark Classic, and much clear ``` PySparkValueError: [INVALID_PANDAS_UDF] Invalid function: pandas_udf with function type GROUPED_MAP or the function in groupby.applyInPandas must take either one argument (data) or two arguments (key, data). ``` ### Does this PR introduce _any_ user-facing change? yes, error message changes ### How was this patch tested? added tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #46519 from zhengruifeng/missing_check_in_group. Authored-by: Ruifeng Zheng Signed-off-by: Dongjoon Hyun ---
(spark) branch master updated: [SPARK-48224][SQL] Disallow map keys from being of variant type
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new b371e7dd8800 [SPARK-48224][SQL] Disallow map keys from being of variant type b371e7dd8800 is described below commit b371e7dd88009195740f8f5b591447441ea43d0b Author: Harsh Motwani AuthorDate: Thu May 9 21:47:05 2024 -0700 [SPARK-48224][SQL] Disallow map keys from being of variant type ### What changes were proposed in this pull request? This PR disallows map keys from being of variant type. Therefore, SQL statements like `select map(parse_json('{"a": 1}'), 1)`, which would work earlier, will throw an exception now. ### Why are the changes needed? Allowing variant to be the key type of a map can result in undefined behavior as this has not been tested. ### Does this PR introduce _any_ user-facing change? Yes, users could use variants as keys in maps earlier. However, this PR disallows this possibility. ### How was this patch tested? Unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #46516 from harshmotw-db/map_variant_key. Authored-by: Harsh Motwani Signed-off-by: Dongjoon Hyun --- .../apache/spark/sql/catalyst/util/TypeUtils.scala | 2 +- .../catalyst/expressions/ComplexTypeSuite.scala| 34 +- 2 files changed, 34 insertions(+), 2 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala index d2c708b380cf..a0d578c66e73 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala @@ -58,7 +58,7 @@ object TypeUtils extends QueryErrorsBase { } def checkForMapKeyType(keyType: DataType): TypeCheckResult = { -if (keyType.existsRecursively(_.isInstanceOf[MapType])) { +if (keyType.existsRecursively(dt => dt.isInstanceOf[MapType] || dt.isInstanceOf[VariantType])) { DataTypeMismatch( errorSubClass = "INVALID_MAP_KEY_TYPE", messageParameters = Map( diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala index 5f135e46a377..497b335289b1 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala @@ -28,7 +28,7 @@ import org.apache.spark.sql.catalyst.util._ import org.apache.spark.sql.catalyst.util.TypeUtils.ordinalNumber import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types._ -import org.apache.spark.unsafe.types.UTF8String +import org.apache.spark.unsafe.types.{UTF8String, VariantVal} class ComplexTypeSuite extends SparkFunSuite with ExpressionEvalHelper { @@ -359,6 +359,38 @@ class ComplexTypeSuite extends SparkFunSuite with ExpressionEvalHelper { ) } + // map key can't be variant + val map6 = CreateMap(Seq( +Literal.create(new VariantVal(Array[Byte](), Array[Byte]())), +Literal.create(1) + )) + map6.checkInputDataTypes() match { +case TypeCheckResult.TypeCheckSuccess => fail("should not allow variant as a part of map key") +case TypeCheckResult.DataTypeMismatch(errorSubClass, messageParameters) => + assert(errorSubClass == "INVALID_MAP_KEY_TYPE") + assert(messageParameters === Map("keyType" -> "\"VARIANT\"")) + } + + // map key can't contain variant + val map7 = CreateMap( +Seq( + CreateStruct( +Seq(Literal.create(1), Literal.create(new VariantVal(Array[Byte](), Array[Byte]( + ), + Literal.create(1) +) + ) + map7.checkInputDataTypes() match { +case TypeCheckResult.TypeCheckSuccess => fail("should not allow variant as a part of map key") +case TypeCheckResult.DataTypeMismatch(errorSubClass, messageParameters) => + assert(errorSubClass == "INVALID_MAP_KEY_TYPE") + assert( +messageParameters === Map( + "keyType" -> "\"STRUCT\"" +) + ) + } + test("MapFromArrays") { val intSeq = Seq(5, 10, 15, 20, 25) val longSeq = intSeq.map(_.toLong) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47018][BUILD][SQL] Bump built-in Hive to 2.3.10
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 2d609bfd37ae [SPARK-47018][BUILD][SQL] Bump built-in Hive to 2.3.10 2d609bfd37ae is described below commit 2d609bfd37ae9a0877fb72d1ba0479bb04a2dad6 Author: Cheng Pan AuthorDate: Thu May 9 21:31:50 2024 -0700 [SPARK-47018][BUILD][SQL] Bump built-in Hive to 2.3.10 ### What changes were proposed in this pull request? This PR aims to bump Spark's built-in Hive from 2.3.9 to Hive 2.3.10, with two additional changes: - due to API breaking changes of Thrift, `libthrift` is upgraded from `0.12` to `0.16`. - remove version management of `commons-lang:2.6`, it comes from Hive transitive deps, Hive 2.3.10 drops it in https://github.com/apache/hive/pull/4892 This is the first part of https://github.com/apache/spark/pull/45372 ### Why are the changes needed? Bump Hive to the latest version of 2.3, prepare for upgrading Guava, and dropping vulnerable dependencies like Jackson 1.x / Jodd ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. (wait for sunchao to complete the 2.3.10 release to make jars visible on Maven Central) ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45372 Closes #46468 from pan3793/SPARK-47018. Lead-authored-by: Cheng Pan Co-authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- connector/kafka-0-10-assembly/pom.xml | 5 connector/kinesis-asl-assembly/pom.xml | 5 dev/deps/spark-deps-hadoop-3-hive-2.3 | 27 +-- docs/building-spark.md | 4 +-- docs/sql-data-sources-hive-tables.md | 8 +++--- docs/sql-migration-guide.md| 2 +- pom.xml| 31 +- .../hive/service/auth/KerberosSaslHelper.java | 5 ++-- .../apache/hive/service/auth/PlainSaslHelper.java | 3 ++- .../hive/service/auth/TSetIpAddressProcessor.java | 5 ++-- .../service/cli/thrift/ThriftBinaryCLIService.java | 6 - .../hive/service/cli/thrift/ThriftCLIService.java | 10 +++ .../org/apache/spark/sql/hive/HiveUtils.scala | 2 +- .../org/apache/spark/sql/hive/client/package.scala | 5 ++-- .../hive/HiveExternalCatalogVersionsSuite.scala| 1 - .../spark/sql/hive/HiveSparkSubmitSuite.scala | 10 +++ .../spark/sql/hive/execution/HiveQuerySuite.scala | 6 ++--- 17 files changed, 61 insertions(+), 74 deletions(-) diff --git a/connector/kafka-0-10-assembly/pom.xml b/connector/kafka-0-10-assembly/pom.xml index b2fcbdf8eca7..bd311b3a9804 100644 --- a/connector/kafka-0-10-assembly/pom.xml +++ b/connector/kafka-0-10-assembly/pom.xml @@ -54,11 +54,6 @@ commons-codec provided - - commons-lang - commons-lang - provided - com.google.protobuf protobuf-java diff --git a/connector/kinesis-asl-assembly/pom.xml b/connector/kinesis-asl-assembly/pom.xml index 577ec2153083..0e93526fce72 100644 --- a/connector/kinesis-asl-assembly/pom.xml +++ b/connector/kinesis-asl-assembly/pom.xml @@ -54,11 +54,6 @@ jackson-databind provided - - commons-lang - commons-lang - provided - org.glassfish.jersey.core jersey-client diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 b/dev/deps/spark-deps-hadoop-3-hive-2.3 index 73d41e9eeb33..392bacd73277 100644 --- a/dev/deps/spark-deps-hadoop-3-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3-hive-2.3 @@ -46,7 +46,6 @@ commons-compress/1.26.1//commons-compress-1.26.1.jar commons-crypto/1.1.0//commons-crypto-1.1.0.jar commons-dbcp/1.4//commons-dbcp-1.4.jar commons-io/2.16.1//commons-io-2.16.1.jar -commons-lang/2.6//commons-lang-2.6.jar commons-lang3/3.14.0//commons-lang3-3.14.0.jar commons-math3/3.6.1//commons-math3-3.6.1.jar commons-pool/1.5.4//commons-pool-1.5.4.jar @@ -81,19 +80,19 @@ hadoop-cloud-storage/3.4.0//hadoop-cloud-storage-3.4.0.jar hadoop-huaweicloud/3.4.0//hadoop-huaweicloud-3.4.0.jar hadoop-shaded-guava/1.2.0//hadoop-shaded-guava-1.2.0.jar hadoop-yarn-server-web-proxy/3.4.0//hadoop-yarn-server-web-proxy-3.4.0.jar -hive-beeline/2.3.9//hive-beeline-2.3.9.jar -hive-cli/2.3.9//hive-cli-2.3.9.jar -hive-common/2.3.9//hive-common-2.3.9.jar -hive-exec/2.3.9/core/hive-exec-2.3.9-core.jar -hive-jdbc/2.3.9//hive-jdbc-2.3.9.jar -hive-llap-common/2.3.9//hive-llap-common-2.3.9.jar -hive-metastore/2.3.9//hive-metastore-2.3.9.jar -hive-serde/2.3.9//hive-serde-2.3.9.jar +hive-beeline/2.3.10//hive-beeline-2.3.10.jar +hive-cli/2.3.10//hive-cli-2.3.10.jar
(spark) branch master updated: [MINOR][BUILD] Remove duplicate configuration of maven-compiler-plugin
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 1138b2a68b54 [MINOR][BUILD] Remove duplicate configuration of maven-compiler-plugin 1138b2a68b54 is described below commit 1138b2a68b5408e6d079bdbce8026323694628e5 Author: zml1206 AuthorDate: Thu May 9 20:51:32 2024 -0700 [MINOR][BUILD] Remove duplicate configuration of maven-compiler-plugin ### What changes were proposed in this pull request? `${java.version}` and `${java.version}` (https://github.com/apache/spark/pull/46024/files#diff-9c5fb3d1b7e3b0f54bc5c4182965c4fe1f9023d449017cece3005d3f90e8e4d8R117) are equivalent duplicate configuration, so remove `${java.version}`. https://maven.apache.org/plugins/maven-compiler-plugin/examples/set-compiler-release.html ### Why are the changes needed? Simplify the code and facilitates subsequent configuration iterations. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46024 from zml1206/remove_duplicate_configuration. Authored-by: zml1206 Signed-off-by: Dongjoon Hyun --- pom.xml | 1 - 1 file changed, 1 deletion(-) diff --git a/pom.xml b/pom.xml index c3ff5d101c22..678455e6e248 100644 --- a/pom.xml +++ b/pom.xml @@ -3127,7 +3127,6 @@ maven-compiler-plugin 3.13.0 -${java.version} true true - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47834][SQL][CONNECT] Mark deprecated functions with `@deprecated` in `SQLImplicits`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 32b2827b964b [SPARK-47834][SQL][CONNECT] Mark deprecated functions with `@deprecated` in `SQLImplicits` 32b2827b964b is described below commit 32b2827b964bd4a4accb60b47ddd6929f41d4a89 Author: YangJie AuthorDate: Thu May 9 20:47:34 2024 -0700 [SPARK-47834][SQL][CONNECT] Mark deprecated functions with `@deprecated` in `SQLImplicits` ### What changes were proposed in this pull request? In the `sql` module, some functions in `SQLImplicits` have already been marked as `deprecated` in the function comments after SPARK-19089. This pr adds `deprecated` type annotation marks to them. Since SPARK-19089 occurred in Spark 2.2.0, the `since` field of `deprecated` is filled in as `2.2.0`. At the same time, these `deprecated` marks have also been synchronized to the corresponding functions in `SQLImplicits` in the `connect` module. ### Why are the changes needed? Mark deprecated functions with `deprecated` in `SQLImplicits` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass Github Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #46029 from LuciferYang/deprecated-SQLImplicits. Lead-authored-by: YangJie Co-authored-by: yangjie01 Signed-off-by: Dongjoon Hyun --- .../jvm/src/main/scala/org/apache/spark/sql/SQLImplicits.scala | 9 + sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala | 9 + 2 files changed, 18 insertions(+) diff --git a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SQLImplicits.scala b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SQLImplicits.scala index 6c626fd716d5..7799d395d5c6 100644 --- a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SQLImplicits.scala +++ b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SQLImplicits.scala @@ -149,6 +149,7 @@ abstract class SQLImplicits private[sql] (session: SparkSession) extends LowPrio * @deprecated * use [[newSequenceEncoder]] */ + @deprecated("Use newSequenceEncoder instead", "2.2.0") val newIntSeqEncoder: Encoder[Seq[Int]] = newSeqEncoder(PrimitiveIntEncoder) /** @@ -156,6 +157,7 @@ abstract class SQLImplicits private[sql] (session: SparkSession) extends LowPrio * @deprecated * use [[newSequenceEncoder]] */ + @deprecated("Use newSequenceEncoder instead", "2.2.0") val newLongSeqEncoder: Encoder[Seq[Long]] = newSeqEncoder(PrimitiveLongEncoder) /** @@ -163,6 +165,7 @@ abstract class SQLImplicits private[sql] (session: SparkSession) extends LowPrio * @deprecated * use [[newSequenceEncoder]] */ + @deprecated("Use newSequenceEncoder instead", "2.2.0") val newDoubleSeqEncoder: Encoder[Seq[Double]] = newSeqEncoder(PrimitiveDoubleEncoder) /** @@ -170,6 +173,7 @@ abstract class SQLImplicits private[sql] (session: SparkSession) extends LowPrio * @deprecated * use [[newSequenceEncoder]] */ + @deprecated("Use newSequenceEncoder instead", "2.2.0") val newFloatSeqEncoder: Encoder[Seq[Float]] = newSeqEncoder(PrimitiveFloatEncoder) /** @@ -177,6 +181,7 @@ abstract class SQLImplicits private[sql] (session: SparkSession) extends LowPrio * @deprecated * use [[newSequenceEncoder]] */ + @deprecated("Use newSequenceEncoder instead", "2.2.0") val newByteSeqEncoder: Encoder[Seq[Byte]] = newSeqEncoder(PrimitiveByteEncoder) /** @@ -184,6 +189,7 @@ abstract class SQLImplicits private[sql] (session: SparkSession) extends LowPrio * @deprecated * use [[newSequenceEncoder]] */ + @deprecated("Use newSequenceEncoder instead", "2.2.0") val newShortSeqEncoder: Encoder[Seq[Short]] = newSeqEncoder(PrimitiveShortEncoder) /** @@ -191,6 +197,7 @@ abstract class SQLImplicits private[sql] (session: SparkSession) extends LowPrio * @deprecated * use [[newSequenceEncoder]] */ + @deprecated("Use newSequenceEncoder instead", "2.2.0") val newBooleanSeqEncoder: Encoder[Seq[Boolean]] = newSeqEncoder(PrimitiveBooleanEncoder) /** @@ -198,6 +205,7 @@ abstract class SQLImplicits private[sql] (session: SparkSession) extends LowPrio * @deprecated * use [[newSequenceEncoder]] */ + @deprecated("Use newSequenceEncoder instead", "2.2.0") val newStringSeqEncoder: Encoder[Seq[String]] = newSeqEncoder(StringEncoder) /** @@ -205,6 +213,7 @@ abstract class SQLImplicits private[sql] (session: SparkSession) extends LowPrio * @deprecated * use [[newSequenceEncoder]] */ + @deprecated("Use newSequenceEncoder instead", "2.2.0") def newProductSeqEncoder[A <:
(spark) branch master updated: [SPARK-48176][SQL] Adjust name of FIELD_ALREADY_EXISTS error condition
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a41d0ae79b43 [SPARK-48176][SQL] Adjust name of FIELD_ALREADY_EXISTS error condition a41d0ae79b43 is described below commit a41d0ae79b432e2757379fc56a0ad2755f02e871 Author: Nicholas Chammas AuthorDate: Fri May 10 12:23:34 2024 +0900 [SPARK-48176][SQL] Adjust name of FIELD_ALREADY_EXISTS error condition ### What changes were proposed in this pull request? Rename `FIELDS_ALREADY_EXISTS` to `FIELD_ALREADY_EXISTS`. ### Why are the changes needed? Though it's not meant to be a proper English sentence, `FIELDS_ALREADY_EXISTS` is grammatically incorrect. It should either be "fields already exist[]" or "field[] already exists". I opted for the latter. ### Does this PR introduce _any_ user-facing change? Yes, it changes the name of an error condition. ### How was this patch tested? CI only. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46510 from nchammas/SPARK-48176-field-exists-error. Authored-by: Nicholas Chammas Signed-off-by: Hyukjin Kwon --- common/utils/src/main/resources/error/error-conditions.json | 2 +- .../src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala | 4 ++-- .../scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala | 2 +- .../test/scala/org/apache/spark/sql/connector/AlterTableTests.scala | 4 ++-- .../apache/spark/sql/connector/V2CommandsCaseSensitivitySuite.scala | 4 ++-- .../sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala | 4 ++-- 6 files changed, 10 insertions(+), 10 deletions(-) diff --git a/common/utils/src/main/resources/error/error-conditions.json b/common/utils/src/main/resources/error/error-conditions.json index 8a64c4c590e8..7c9886c749b9 100644 --- a/common/utils/src/main/resources/error/error-conditions.json +++ b/common/utils/src/main/resources/error/error-conditions.json @@ -1339,7 +1339,7 @@ ], "sqlState" : "54001" }, - "FIELDS_ALREADY_EXISTS" : { + "FIELD_ALREADY_EXISTS" : { "message" : [ "Cannot column, because already exists in ." ], diff --git a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala index c80fbfc748dd..b60107f90283 100644 --- a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala +++ b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala @@ -107,7 +107,7 @@ private[v2] trait V2JDBCTest extends SharedSparkSession with DockerIntegrationFu exception = intercept[AnalysisException] { sql(s"ALTER TABLE $catalogName.alt_table ADD COLUMNS (C3 DOUBLE)") }, -errorClass = "FIELDS_ALREADY_EXISTS", +errorClass = "FIELD_ALREADY_EXISTS", parameters = Map( "op" -> "add", "fieldNames" -> "`C3`", @@ -179,7 +179,7 @@ private[v2] trait V2JDBCTest extends SharedSparkSession with DockerIntegrationFu exception = intercept[AnalysisException] { sql(s"ALTER TABLE $catalogName.alt_table RENAME COLUMN ID1 TO ID2") }, -errorClass = "FIELDS_ALREADY_EXISTS", +errorClass = "FIELD_ALREADY_EXISTS", parameters = Map( "op" -> "rename", "fieldNames" -> "`ID2`", diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala index e55f23b6aa86..e18f4d1b36e1 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala @@ -1403,7 +1403,7 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB if (struct.findNestedField( fieldNames, includeCollections = true, alter.conf.resolver).isDefined) { alter.failAnalysis( - errorClass = "FIELDS_ALREADY_EXISTS", + errorClass = "FIELD_ALREADY_EXISTS", messageParameters = Map( "op" -> op, "fieldNames" -> toSQLId(fieldNames), diff --git a/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala b/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala index 996d7acb1148..28605958c71d 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala @@ -466,7 +466,7 @@
(spark) branch master updated: [SPARK-48222][INFRA][DOCS] Sync Ruby Bundler to 2.4.22 and refresh Gem lock file
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 9a2818820f11 [SPARK-48222][INFRA][DOCS] Sync Ruby Bundler to 2.4.22 and refresh Gem lock file 9a2818820f11 is described below commit 9a2818820f11f9bdcc042f4ab80850918911c68c Author: Nicholas Chammas AuthorDate: Fri May 10 09:58:16 2024 +0800 [SPARK-48222][INFRA][DOCS] Sync Ruby Bundler to 2.4.22 and refresh Gem lock file ### What changes were proposed in this pull request? Sync the version of Bundler that we are using across various scripts and documentation. Also refresh the Gem lock file. ### Why are the changes needed? We are seeing inconsistent build behavior, likely due to the inconsistent Bundler versions. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI + the preview release process. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46512 from nchammas/bundler-sync. Authored-by: Nicholas Chammas Signed-off-by: Wenchen Fan --- .github/workflows/build_and_test.yml | 3 +++ dev/create-release/spark-rm/Dockerfile | 2 +- docs/Gemfile.lock | 16 docs/README.md | 2 +- 4 files changed, 13 insertions(+), 10 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 4a11823aee60..881fb8cb0674 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -872,6 +872,9 @@ jobs: python3.9 -m pip install 'docutils<0.18.0' # See SPARK-39421 - name: Install dependencies for documentation generation run: | +# Keep the version of Bundler here in sync with the following locations: +# - dev/create-release/spark-rm/Dockerfile +# - docs/README.md gem install bundler -v 2.4.22 cd docs bundle install diff --git a/dev/create-release/spark-rm/Dockerfile b/dev/create-release/spark-rm/Dockerfile index 8d5ca38ba88e..13f4112ca03d 100644 --- a/dev/create-release/spark-rm/Dockerfile +++ b/dev/create-release/spark-rm/Dockerfile @@ -38,7 +38,7 @@ ENV DEBCONF_NONINTERACTIVE_SEEN true ARG APT_INSTALL="apt-get install --no-install-recommends -y" ARG PIP_PKGS="sphinx==4.5.0 mkdocs==1.1.2 numpy==1.20.3 pydata_sphinx_theme==0.13.3 ipython==7.19.0 nbsphinx==0.8.0 numpydoc==1.1.0 jinja2==3.1.2 twine==3.4.1 sphinx-plotly-directive==0.1.3 sphinx-copybutton==0.5.2 pandas==2.0.3 pyarrow==10.0.1 plotly==5.4.0 markupsafe==2.0.1 docutils<0.17 grpcio==1.62.0 protobuf==4.21.6 grpcio-status==1.62.0 googleapis-common-protos==1.56.4" -ARG GEM_PKGS="bundler:2.3.8" +ARG GEM_PKGS="bundler:2.4.22" # Install extra needed repos and refresh. # - CRAN repo diff --git a/docs/Gemfile.lock b/docs/Gemfile.lock index 4e38f18703f3..e137f0f039b9 100644 --- a/docs/Gemfile.lock +++ b/docs/Gemfile.lock @@ -4,16 +4,16 @@ GEM addressable (2.8.6) public_suffix (>= 2.0.2, < 6.0) colorator (1.1.0) -concurrent-ruby (1.2.2) +concurrent-ruby (1.2.3) em-websocket (0.5.3) eventmachine (>= 0.12.9) http_parser.rb (~> 0) eventmachine (1.2.7) ffi (1.16.3) forwardable-extended (2.6.0) -google-protobuf (3.25.2) +google-protobuf (3.25.3) http_parser.rb (0.8.0) -i18n (1.14.1) +i18n (1.14.5) concurrent-ruby (~> 1.0) jekyll (4.3.3) addressable (~> 2.4) @@ -42,22 +42,22 @@ GEM kramdown-parser-gfm (1.1.0) kramdown (~> 2.0) liquid (4.0.4) -listen (3.8.0) +listen (3.9.0) rb-fsevent (~> 0.10, >= 0.10.3) rb-inotify (~> 0.9, >= 0.9.10) mercenary (0.4.0) pathutil (0.16.2) forwardable-extended (~> 2.6) -public_suffix (5.0.4) -rake (13.1.0) +public_suffix (5.0.5) +rake (13.2.1) rb-fsevent (0.11.2) rb-inotify (0.10.1) ffi (~> 1.0) rexml (3.2.6) rouge (3.30.0) safe_yaml (1.0.5) -sass-embedded (1.69.7) - google-protobuf (~> 3.25) +sass-embedded (1.63.6) + google-protobuf (~> 3.23) rake (>= 13.0.0) terminal-table (3.0.2) unicode-display_width (>= 1.1.1, < 3) diff --git a/docs/README.md b/docs/README.md index 414c8dbd8303..363f1c207636 100644 --- a/docs/README.md +++ b/docs/README.md @@ -36,7 +36,7 @@ You need to have [Ruby 3][ruby] and [Python 3][python] installed. Make sure the [python]: https://www.python.org/downloads/ ```sh -$ gem install bundler +$ gem install bundler -v 2.4.22 ``` After this all the required Ruby dependencies can be installed from the `docs/` directory via Bundler: - To
(spark) branch master updated: [SPARK-48227][PYTHON][DOC] Document the requirement of seed in protos
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 012d19d8e9b2 [SPARK-48227][PYTHON][DOC] Document the requirement of seed in protos 012d19d8e9b2 is described below commit 012d19d8e9b28f7ce266753bcfff4a76c9510245 Author: Ruifeng Zheng AuthorDate: Thu May 9 16:58:44 2024 -0700 [SPARK-48227][PYTHON][DOC] Document the requirement of seed in protos ### What changes were proposed in this pull request? Document the requirement of seed in protos ### Why are the changes needed? the seed should be set at client side document it to avoid cases like https://github.com/apache/spark/pull/46456 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #46518 from zhengruifeng/doc_random. Authored-by: Ruifeng Zheng Signed-off-by: Dongjoon Hyun --- .../common/src/main/protobuf/spark/connect/relations.proto | 8 ++-- python/pyspark/sql/connect/plan.py | 10 -- python/pyspark/sql/connect/proto/relations_pb2.pyi | 10 -- 3 files changed, 18 insertions(+), 10 deletions(-) diff --git a/connector/connect/common/src/main/protobuf/spark/connect/relations.proto b/connector/connect/common/src/main/protobuf/spark/connect/relations.proto index 3882b2e85396..0b3c9d4253e8 100644 --- a/connector/connect/common/src/main/protobuf/spark/connect/relations.proto +++ b/connector/connect/common/src/main/protobuf/spark/connect/relations.proto @@ -467,7 +467,9 @@ message Sample { // (Optional) Whether to sample with replacement. optional bool with_replacement = 4; - // (Optional) The random seed. + // (Required) The random seed. + // This filed is required to avoid generate mutable dataframes (see SPARK-48184 for details), + // however, still keep it 'optional' here for backward compatibility. optional int64 seed = 5; // (Required) Explicitly sort the underlying plan to make the ordering deterministic or cache it. @@ -687,7 +689,9 @@ message StatSampleBy { // If a stratum is not specified, we treat its fraction as zero. repeated Fraction fractions = 3; - // (Optional) The random seed. + // (Required) The random seed. + // This filed is required to avoid generate mutable dataframes (see SPARK-48184 for details), + // however, still keep it 'optional' here for backward compatibility. optional int64 seed = 5; message Fraction { diff --git a/python/pyspark/sql/connect/plan.py b/python/pyspark/sql/connect/plan.py index 4ac4946745f5..3d3303fb15c5 100644 --- a/python/pyspark/sql/connect/plan.py +++ b/python/pyspark/sql/connect/plan.py @@ -717,7 +717,7 @@ class Sample(LogicalPlan): lower_bound: float, upper_bound: float, with_replacement: bool, -seed: Optional[int], +seed: int, deterministic_order: bool = False, ) -> None: super().__init__(child) @@ -734,8 +734,7 @@ class Sample(LogicalPlan): plan.sample.lower_bound = self.lower_bound plan.sample.upper_bound = self.upper_bound plan.sample.with_replacement = self.with_replacement -if self.seed is not None: -plan.sample.seed = self.seed +plan.sample.seed = self.seed plan.sample.deterministic_order = self.deterministic_order return plan @@ -1526,7 +1525,7 @@ class StatSampleBy(LogicalPlan): child: Optional["LogicalPlan"], col: Column, fractions: Sequence[Tuple[Column, float]], -seed: Optional[int], +seed: int, ) -> None: super().__init__(child) @@ -1554,8 +1553,7 @@ class StatSampleBy(LogicalPlan): fraction.stratum.CopyFrom(k.to_plan(session).literal) fraction.fraction = float(v) plan.sample_by.fractions.append(fraction) -if self._seed is not None: -plan.sample_by.seed = self._seed +plan.sample_by.seed = self._seed return plan diff --git a/python/pyspark/sql/connect/proto/relations_pb2.pyi b/python/pyspark/sql/connect/proto/relations_pb2.pyi index 5dfb47da67a9..9b6f4b43544f 100644 --- a/python/pyspark/sql/connect/proto/relations_pb2.pyi +++ b/python/pyspark/sql/connect/proto/relations_pb2.pyi @@ -1865,7 +1865,10 @@ class Sample(google.protobuf.message.Message): with_replacement: builtins.bool """(Optional) Whether to sample with replacement.""" seed: builtins.int -"""(Optional) The random seed.""" +"""(Required) The random seed. +This filed is required to avoid generate mutable dataframes (see SPARK-48184 for details), +however, still keep it 'optional' here for
(spark) branch master updated: [SPARK-48180][SQL] Improve error when UDTF call with TABLE arg forgets parentheses around multiple PARTITION/ORDER BY exprs
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 71f0eda71bc1 [SPARK-48180][SQL] Improve error when UDTF call with TABLE arg forgets parentheses around multiple PARTITION/ORDER BY exprs 71f0eda71bc1 is described below commit 71f0eda71bc169a5245f4412ec0957728025a66c Author: Daniel Tenedorio AuthorDate: Fri May 10 08:44:07 2024 +0900 [SPARK-48180][SQL] Improve error when UDTF call with TABLE arg forgets parentheses around multiple PARTITION/ORDER BY exprs ### What changes were proposed in this pull request? This PR improves the error message when a table-valued function call has a TABLE argument with a PARTITION BY or ORDER BY clause with more than one associated expression, but forgets parentheses around them. For example: ``` SELECT * FROM testUDTF( TABLE(SELECT 1 AS device_id, 2 AS data_ds) WITH SINGLE PARTITION ORDER BY device_id, data_ds) ``` This query previously returned an obscure, unrelated error: ``` [UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_TABLE_ARGUMENT] Unsupported subquery expression: Table arguments are used in a function where they are not supported: 'UnresolvedTableValuedFunction [tvf], [table-argument#338 [], 'data_ds](https://issues.apache.org/jira/browse/SPARK-48180#338%20[],%20'data_ds), false +- Project [1 AS device_id#336, 2 AS data_ds#337](https://issues.apache.org/jira/browse/SPARK-48180#336,%202%20AS%20data_ds#337) +- OneRowRelation ``` Now it returns a reasonable error: ``` The table function call includes a table argument with an invalid partitioning/ordering specification: the ORDER BY clause included multiple expressions without parentheses surrounding them; please add parentheses around these expressions and then retry the query again. (line 4, pos 2) == SQL == SELECT * FROM testUDTF( TABLE(SELECT 1 AS device_id, 2 AS data_ds) WITH SINGLE PARTITION --^^^ ORDER BY device_id, data_ds) ``` ### Why are the changes needed? Here we improve error messages for common SQL syntax mistakes. ### Does this PR introduce _any_ user-facing change? Yes, see above. ### How was this patch tested? This PR adds test coverage. ### Was this patch authored or co-authored using generative AI tooling? No Closes #46451 from dtenedor/udtf-analyzer-bug. Authored-by: Daniel Tenedorio Signed-off-by: Hyukjin Kwon --- .../spark/sql/catalyst/parser/SqlBaseParser.g4 | 2 ++ .../spark/sql/catalyst/parser/AstBuilder.scala | 14 .../sql/catalyst/parser/PlanParserSuite.scala | 19 +++- .../sql/execution/python/PythonUDTFSuite.scala | 26 ++ 4 files changed, 56 insertions(+), 5 deletions(-) diff --git a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4 b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4 index 653224c5475f..249f55fa40ac 100644 --- a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4 +++ b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4 @@ -838,9 +838,11 @@ tableArgumentPartitioning : ((WITH SINGLE PARTITION) | ((PARTITION | DISTRIBUTE) BY (((LEFT_PAREN partition+=expression (COMMA partition+=expression)* RIGHT_PAREN)) +| (expression (COMMA invalidMultiPartitionExpression=expression)+) | partition+=expression))) ((ORDER | SORT) BY (((LEFT_PAREN sortItem (COMMA sortItem)* RIGHT_PAREN) +| (sortItem (COMMA invalidMultiSortItem=sortItem)+) | sortItem)))? ; diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala index 7d2355b2f08d..326f1e7684b9 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala @@ -1638,6 +1638,20 @@ class AstBuilder extends DataTypeAstBuilder with SQLConfHelper with Logging { } partitionByExpressions = p.partition.asScala.map(expression).toSeq orderByExpressions = p.sortItem.asScala.map(visitSortItem).toSeq + def invalidPartitionOrOrderingExpression(clause: String): String = { +"The table function call includes a table argument with an invalid " + + s"partitioning/ordering specification: the $clause clause included multiple " + + "expressions without parentheses surrounding them; please add parentheses around " +
(spark) branch master updated (b47d7853d92f -> e704b9e56b0c)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from b47d7853d92f [SPARK-48148][CORE] JSON objects should not be modified when read as STRING add e704b9e56b0c [SPARK-48226][BUILD] Add `spark-ganglia-lgpl` to `lint-java` & `spark-ganglia-lgpl` and `jvm-profiler` to `sbt-checkstyle` No new revisions were added by this update. Summary of changes: dev/lint-java | 2 +- dev/sbt-checkstyle | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48148][CORE] JSON objects should not be modified when read as STRING
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new b47d7853d92f [SPARK-48148][CORE] JSON objects should not be modified when read as STRING b47d7853d92f is described below commit b47d7853d92f733791513094af04fc18ec947246 Author: Eric Maynard AuthorDate: Fri May 10 08:41:24 2024 +0900 [SPARK-48148][CORE] JSON objects should not be modified when read as STRING ### What changes were proposed in this pull request? Currently, when reading a JSON like this: ``` {"a": {"b": -999.995}} ``` With the schema: ``` a STRING ``` Spark will yield a result like this: ``` {"b": -1000.0} ``` Other changes such as changes to the input string's whitespace may also occur. In some cases, we apply scientific notation to an input floating-point number when reading it as STRING. This applies to reading JSON files (as with `spark.read.json`) as well as the SQL expression `from_json`. ### Why are the changes needed? Correctness issues may occur if a field is read as a STRING and then later parsed (e.g. with `from_json`) after the contents have been modified. ### Does this PR introduce _any_ user-facing change? Yes, when reading non-string fields from a JSON object using the STRING type, we will now extract the field exactly as it appears. ### How was this patch tested? Added a test in `JsonSuite.scala` ### Was this patch authored or co-authored using generative AI tooling? No Closes #46408 from eric-maynard/SPARK-48148. Lead-authored-by: Eric Maynard Co-authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon --- .../spark/sql/catalyst/json/JacksonParser.scala| 63 +++--- .../org/apache/spark/sql/internal/SQLConf.scala| 9 .../sql/execution/datasources/json/JsonSuite.scala | 58 3 files changed, 122 insertions(+), 8 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala index 5e75ff6f6e1a..b2c302fbbbe3 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala @@ -24,6 +24,7 @@ import scala.collection.mutable.ArrayBuffer import scala.util.control.NonFatal import com.fasterxml.jackson.core._ +import org.apache.hadoop.fs.PositionedReadable import org.apache.spark.SparkUpgradeException import org.apache.spark.internal.Logging @@ -275,19 +276,63 @@ class JacksonParser( } } -case _: StringType => - (parser: JsonParser) => parseJsonToken[UTF8String](parser, dataType) { +case _: StringType => (parser: JsonParser) => { + // This must be enabled if we will retrieve the bytes directly from the raw content: + val includeSourceInLocation = JsonParser.Feature.INCLUDE_SOURCE_IN_LOCATION + val originalMask = if (includeSourceInLocation.enabledIn(parser.getFeatureMask)) { +1 + } else { +0 + } + parser.overrideStdFeatures(includeSourceInLocation.getMask, includeSourceInLocation.getMask) + val result = parseJsonToken[UTF8String](parser, dataType) { case VALUE_STRING => UTF8String.fromString(parser.getText) -case _ => +case other => // Note that it always tries to convert the data as string without the case of failure. - val writer = new ByteArrayOutputStream() - Utils.tryWithResource(factory.createGenerator(writer, JsonEncoding.UTF8)) { -generator => generator.copyCurrentStructure(parser) + val startLocation = parser.currentTokenLocation() + def skipAhead(): Unit = { +other match { + case START_OBJECT => +parser.skipChildren() + case START_ARRAY => +parser.skipChildren() + case _ => + // Do nothing in this case; we've already read the token +} } - UTF8String.fromBytes(writer.toByteArray) - } + + // PositionedReadable + startLocation.contentReference().getRawContent match { +case byteArray: Array[Byte] if exactStringParsing => + skipAhead() + val endLocation = parser.currentLocation.getByteOffset + + UTF8String.fromBytes( +byteArray, +startLocation.getByteOffset.toInt, +endLocation.toInt - (startLocation.getByteOffset.toInt)) +case positionedReadable: PositionedReadable if
(spark) branch branch-3.5 updated (da4c808be7d6 -> dc4911725baa)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git from da4c808be7d6 [SPARK-48197][SQL][TESTS][FOLLOWUP][3.5] Regenerate golden files add dc4911725baa [SPARK-48089][SS][CONNECT] Fix 3.5 <> 4.0 StreamingQueryListener compatibility test No new revisions were added by this update. Summary of changes: .../connect/streaming/test_parity_listener.py | 89 +++--- 1 file changed, 44 insertions(+), 45 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-48197][SQL][TESTS][FOLLOWUP][3.5] Regenerate golden files
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new da4c808be7d6 [SPARK-48197][SQL][TESTS][FOLLOWUP][3.5] Regenerate golden files da4c808be7d6 is described below commit da4c808be7d66dc61fdcb3b41254eef77298a72c Author: Dongjoon Hyun AuthorDate: Thu May 9 14:46:01 2024 -0700 [SPARK-48197][SQL][TESTS][FOLLOWUP][3.5] Regenerate golden files ### What changes were proposed in this pull request? This PR is a follow-up to regenerate golden files for branch-3.5 - #46475 ### Why are the changes needed? To recover branch-3.5 CI. - https://github.com/apache/spark/actions/runs/9011670853/job/24786397001 ``` [info] *** 4 TESTS FAILED *** [error] Failed: Total 3036, Failed 4, Errors 0, Passed 3032, Ignored 3 [error] Failed tests: [error] org.apache.spark.sql.SQLQueryTestSuite ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46514 from dongjoon-hyun/SPARK-48197. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../sql-tests/analyzer-results/ansi/higher-order-functions.sql.out | 1 - .../resources/sql-tests/analyzer-results/higher-order-functions.sql.out | 1 - .../test/resources/sql-tests/results/ansi/higher-order-functions.sql.out | 1 - .../src/test/resources/sql-tests/results/higher-order-functions.sql.out | 1 - 4 files changed, 4 deletions(-) diff --git a/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/higher-order-functions.sql.out b/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/higher-order-functions.sql.out index 3fafb9858e5a..8fe6e7097e67 100644 --- a/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/higher-order-functions.sql.out +++ b/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/higher-order-functions.sql.out @@ -40,7 +40,6 @@ select ceil(x -> x) as v org.apache.spark.sql.AnalysisException { "errorClass" : "INVALID_LAMBDA_FUNCTION_CALL.NON_HIGHER_ORDER_FUNCTION", - "sqlState" : "42K0D", "messageParameters" : { "class" : "org.apache.spark.sql.catalyst.expressions.CeilExpressionBuilder$" }, diff --git a/sql/core/src/test/resources/sql-tests/analyzer-results/higher-order-functions.sql.out b/sql/core/src/test/resources/sql-tests/analyzer-results/higher-order-functions.sql.out index d9e88ac618aa..d85101986078 100644 --- a/sql/core/src/test/resources/sql-tests/analyzer-results/higher-order-functions.sql.out +++ b/sql/core/src/test/resources/sql-tests/analyzer-results/higher-order-functions.sql.out @@ -40,7 +40,6 @@ select ceil(x -> x) as v org.apache.spark.sql.AnalysisException { "errorClass" : "INVALID_LAMBDA_FUNCTION_CALL.NON_HIGHER_ORDER_FUNCTION", - "sqlState" : "42K0D", "messageParameters" : { "class" : "org.apache.spark.sql.catalyst.expressions.CeilExpressionBuilder$" }, diff --git a/sql/core/src/test/resources/sql-tests/results/ansi/higher-order-functions.sql.out b/sql/core/src/test/resources/sql-tests/results/ansi/higher-order-functions.sql.out index eb9c454109f0..dceb370c8388 100644 --- a/sql/core/src/test/resources/sql-tests/results/ansi/higher-order-functions.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/ansi/higher-order-functions.sql.out @@ -40,7 +40,6 @@ struct<> org.apache.spark.sql.AnalysisException { "errorClass" : "INVALID_LAMBDA_FUNCTION_CALL.NON_HIGHER_ORDER_FUNCTION", - "sqlState" : "42K0D", "messageParameters" : { "class" : "org.apache.spark.sql.catalyst.expressions.CeilExpressionBuilder$" }, diff --git a/sql/core/src/test/resources/sql-tests/results/higher-order-functions.sql.out b/sql/core/src/test/resources/sql-tests/results/higher-order-functions.sql.out index eb9c454109f0..dceb370c8388 100644 --- a/sql/core/src/test/resources/sql-tests/results/higher-order-functions.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/higher-order-functions.sql.out @@ -40,7 +40,6 @@ struct<> org.apache.spark.sql.AnalysisException { "errorClass" : "INVALID_LAMBDA_FUNCTION_CALL.NON_HIGHER_ORDER_FUNCTION", - "sqlState" : "42K0D", "messageParameters" : { "class" : "org.apache.spark.sql.catalyst.expressions.CeilExpressionBuilder$" }, - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
svn commit: r69065 - /dev/spark/v4.0.0-preview1-rc1-bin/
Author: wenchen Date: Thu May 9 16:31:11 2024 New Revision: 69065 Log: Apache Spark v4.0.0-preview1-rc1 Added: dev/spark/v4.0.0-preview1-rc1-bin/ dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz (with props) dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.asc dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.sha512 dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz (with props) dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz.asc dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz.sha512 dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-hadoop3.tgz (with props) dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-hadoop3.tgz.asc dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-hadoop3.tgz.sha512 dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-without-hadoop.tgz (with props) dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-without-hadoop.tgz.asc dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-without-hadoop.tgz.sha512 dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1.tgz (with props) dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1.tgz.asc dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1.tgz.sha512 Added: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz == Binary file - no diff available. Propchange: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz -- svn:mime-type = application/octet-stream Added: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.asc == --- dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.asc (added) +++ dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.asc Thu May 9 16:31:11 2024 @@ -0,0 +1,17 @@ +-BEGIN PGP SIGNATURE- + +iQJHBAABCgAxFiEETclnbO+ag+mPygJ4TWYghDzYf1oFAmY8+e4THHdlbmNoZW5A +YXBhY2hlLm9yZwAKCRBNZiCEPNh/Wv78D/9aNsBANuVpIjYr+XkWYaimRLJ5IT0Z +qKehjJBuMBDaBMMN3iWconDHBiASQT0FTYGDBeYI72fLFSMKBna5+Lu22+KD/K6h +V8SZxPSQsAHQABYq9ha++XXyo1Vo+msPQ0pQAblmTrSpsvSWZmC8spzb5GbKYvK5 +kxr4Qt1XnHeGNJNToqGlbl/Hc2Etg5PkPBxMPBWMh7kLknMEscMNUf87JqCIa8LG +hMid/0lrrevEm8gkuu0ol9Vgz4P+dreKE9eCfmWOXCod04y8tJnVPs83wUOZfmKV +dHkELaMVwz3fa40QP77gK38K5i22aUgYk6dvhB+OgtatZ5tk0Dxp3AI2OObngEUm +4cGmQLwcses53vApwkExq427gS8td4sTE2G1D4+hSdEcm8Fj69w4Ado/DlIAHZob +KLV15qtNOyaIapT4GxBqoeqsw7tnRmxiP8K8UxFcPV/vZC1yQKIIULigPjttZKoW ++REE2N7ZyPvbvgItwjAL8hpCeYEkd7RDa7ofHAv6icC1qSsJZ9gxFM4rJvriI4g2 +tnYEvZduGpBunhlwVb0R3kAF5XoLIZQ5qm6kyWAzioc0gxzYVc3Rd+bXjm+vmopt +bXHOM6N2lLQwqnWlHsyjGVFugrkkRXZbQbIV6FynXpKaz5YtkUhUMkofz7mOYhBi ++1Z8nZ04B6YLbw== +=85FX +-END PGP SIGNATURE- Added: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.sha512 == --- dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.sha512 (added) +++ dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.sha512 Thu May 9 16:31:11 2024 @@ -0,0 +1 @@ +2509cf6473495b0cd5c132d87f5e1c33593fa7375ca01bcab1483093cea92bdb6ad7afc7c72095376b28fc5acdc71bb323935d17513f33ee5276c6991ff668d1 pyspark-4.0.0.dev1.tar.gz Added: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz == Binary file - no diff available. Propchange: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz -- svn:mime-type = application/octet-stream Added: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz.asc == --- dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz.asc (added) +++ dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz.asc Thu May 9 16:31:11 2024 @@ -0,0 +1,17 @@ +-BEGIN PGP SIGNATURE- + +iQJGBAABCgAxFiEETclnbO+ag+mPygJ4TWYghDzYf1oFAmY8+fATHHdlbmNoZW5A +YXBhY2hlLm9yZwAKCRBNZiCEPNh/WoCMD/iZjkaGTUqt3jkIjWIUzpQo+kLn8//m +f+hwUtAguXvbMJXwBOz/Q/f+KvGk0tutsbd6rmBB6cHjH4GoZPp1x6iBitFAO47r +kHy/0xYkb70SPQCWIGQQpRv3g0uxTmpqL9H4YcIvexkV2wXG5VSwGvbSI4596n7l +x7M3rRmFzrxhcNIYLQdhNuat0mwuJFWe6R7Zk7UYFFishn9dNt8EOYx8vsGAuMP8 +Uy3+7oZQOAGqdQGSL7Ev4Pqve7MrrPgGXaixGukXibi707NCURnHTDcenPfoEEiQ +Hj83I3G+JrRhtsue/103a/GnHheUgwE8oEkefnUX7qC5tSn4T8lI2KpDBv9AL1pm +Bv0eXf5X5xEM4wvO7DCgbeEDPLg72jjt9X8zjAYx05HddvTuPjeKEL+Ga6G0ueTz +HRXHrgd1EFZ1znPZhWiSTmeqZTXdrb6wKTYt8Y6mk1oEGL3b0qE2LNkSED+4l40u +41MlV3pmZyjRGYZl29XZKf4isKYyjec7UbJSM5ok4zCRF0p8Gvj0EihGS4X6rYpW +9XxwjViKMIp7DCEcWjWpO6pJ8Ygb2Snh1UTFFgtzSVAoMqUgHnBHejJ4RA4ncHu6
(spark) branch master updated: [SPARK-48216][TESTS] Remove overrides DockerJDBCIntegrationSuite.connectionTimeout to make related tests configurable
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new e1fb1d7e063a [SPARK-48216][TESTS] Remove overrides DockerJDBCIntegrationSuite.connectionTimeout to make related tests configurable e1fb1d7e063a is described below commit e1fb1d7e063af7e8eb6e992c800902aff6e19e15 Author: Kent Yao AuthorDate: Thu May 9 08:37:07 2024 -0700 [SPARK-48216][TESTS] Remove overrides DockerJDBCIntegrationSuite.connectionTimeout to make related tests configurable ### What changes were proposed in this pull request? This PR removes overrides DockerJDBCIntegrationSuite.connectionTimeout to make related tests configurable. ### Why are the changes needed? The db dockers might require more time to bootstrap sometimes. It shall be configurable to avoid failure like: ```scala [info] org.apache.spark.sql.jdbc.DB2IntegrationSuite *** ABORTED *** (3 minutes, 11 seconds) [info] The code passed to eventually never returned normally. Attempted 96 times over 3.00399815763 minutes. Last failure message: [jcc][t4][2030][11211][4.33.31] A communication error occurred during operations on the connection's underlying socket, socket input stream, [info] or socket output stream. Error location: Reply.fill() - insufficient data (-1). Message: Insufficient data. ERRORCODE=-4499, SQLSTATE=08001. (DockerJDBCIntegrationSuite.scala:215) [info] org.scalatest.exceptions.TestFailedDueToTimeoutException: [info] at org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:219) [info] at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:226) [info] at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:313) [info] at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:312) ``` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Passing GA ### Was this patch authored or co-authored using generative AI tooling? no Closes #46505 from yaooqinn/SPARK-48216. Authored-by: Kent Yao Signed-off-by: Dongjoon Hyun --- .../test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala| 4 .../test/scala/org/apache/spark/sql/jdbc/DB2KrbIntegrationSuite.scala | 3 --- .../test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala | 4 .../test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala | 3 --- .../org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala| 4 .../scala/org/apache/spark/sql/jdbc/v2/MySQLIntegrationSuite.scala| 4 .../scala/org/apache/spark/sql/jdbc/v2/OracleIntegrationSuite.scala | 4 7 files changed, 26 deletions(-) diff --git a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala index aca174cce194..4ece4d2088f4 100644 --- a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala +++ b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala @@ -21,8 +21,6 @@ import java.math.BigDecimal import java.sql.{Connection, Date, Timestamp} import java.util.Properties -import org.scalatest.time.SpanSugar._ - import org.apache.spark.sql.{Row, SaveMode} import org.apache.spark.sql.catalyst.util.DateTimeTestUtils._ import org.apache.spark.sql.internal.SQLConf @@ -41,8 +39,6 @@ import org.apache.spark.tags.DockerTest class DB2IntegrationSuite extends DockerJDBCIntegrationSuite { override val db = new DB2DatabaseOnDocker - override val connectionTimeout = timeout(3.minutes) - override def dataPreparation(conn: Connection): Unit = { conn.prepareStatement("CREATE TABLE tbl (x INTEGER, y VARCHAR(8))").executeUpdate() conn.prepareStatement("INSERT INTO tbl VALUES (42,'fred')").executeUpdate() diff --git a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2KrbIntegrationSuite.scala b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2KrbIntegrationSuite.scala index abb683c06495..4899de2b2a14 100644 --- a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2KrbIntegrationSuite.scala +++ b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2KrbIntegrationSuite.scala @@ -24,7 +24,6 @@ import javax.security.auth.login.Configuration import com.github.dockerjava.api.model.{AccessMode, Bind, ContainerConfig, HostConfig, Volume} import org.apache.hadoop.security.{SecurityUtil, UserGroupInformation} import
(spark) branch master updated: [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions (for UTF8_BINARY & LCASE)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 21333f8c1fc0 [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions (for UTF8_BINARY & LCASE) 21333f8c1fc0 is described below commit 21333f8c1fc01756e6708ad6ccf21f585fcb881d Author: David Milicevic AuthorDate: Thu May 9 23:05:20 2024 +0800 [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions (for UTF8_BINARY & LCASE) Recreating [original PR](https://github.com/apache/spark/pull/45749) because code has been reorganized in [this PR](https://github.com/apache/spark/pull/45978). ### What changes were proposed in this pull request? This PR is created to add support for collations to StringTrim family of functions/expressions, specifically: - `StringTrim` - `StringTrimBoth` - `StringTrimLeft` - `StringTrimRight` Changes: - `CollationSupport.java` - Add new `StringTrim`, `StringTrimLeft` and `StringTrimRight` classes with corresponding logic. - `CollationAwareUTF8String` - add new `trim`, `trimLeft` and `trimRight` methods that actually implement trim logic. - `UTF8String.java` - expose some of the methods publicly. - `stringExpressions.scala` - Change input types. - Change eval and code gen logic. - `CollationTypeCasts.scala` - add `StringTrim*` expressions to `CollationTypeCasts` rules. ### Why are the changes needed? We are incrementally adding collation support to a built-in string functions in Spark. ### Does this PR introduce _any_ user-facing change? Yes: - User should now be able to use non-default collations in string trim functions. ### How was this patch tested? Already existing tests + new unit/e2e tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46206 from davidm-db/string-trim-functions. Authored-by: David Milicevic Signed-off-by: Wenchen Fan --- .../catalyst/util/CollationAwareUTF8String.java| 470 ++ .../spark/sql/catalyst/util/CollationSupport.java | 534 - .../org/apache/spark/unsafe/types/UTF8String.java | 2 +- .../spark/unsafe/types/CollationSupportSuite.java | 193 .../sql/catalyst/analysis/CollationTypeCasts.scala | 2 +- .../catalyst/expressions/stringExpressions.scala | 53 +- .../sql/CollationStringExpressionsSuite.scala | 161 ++- 7 files changed, 1054 insertions(+), 361 deletions(-) diff --git a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java new file mode 100644 index ..ee0d611d7e65 --- /dev/null +++ b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java @@ -0,0 +1,470 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.catalyst.util; + +import com.ibm.icu.lang.UCharacter; +import com.ibm.icu.text.BreakIterator; +import com.ibm.icu.text.StringSearch; +import com.ibm.icu.util.ULocale; + +import org.apache.spark.unsafe.UTF8StringBuilder; +import org.apache.spark.unsafe.types.UTF8String; + +import static org.apache.spark.unsafe.Platform.BYTE_ARRAY_OFFSET; +import static org.apache.spark.unsafe.Platform.copyMemory; + +import java.util.HashMap; +import java.util.Map; + +/** + * Utility class for collation-aware UTF8String operations. + */ +public class CollationAwareUTF8String { + public static UTF8String replace(final UTF8String src, final UTF8String search, + final UTF8String replace, final int collationId) { +// This collation aware implementation is based on existing implementation on UTF8String +if (src.numBytes() == 0 || search.numBytes() == 0) { + return src; +} + +StringSearch stringSearch = CollationFactory.getStringSearch(src, search, collationId); + +// Find the
(spark) branch master updated: [SPARK-47803][FOLLOWUP] Check nulls when casting nested type to variant
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 3fd38d4c07f6 [SPARK-47803][FOLLOWUP] Check nulls when casting nested type to variant 3fd38d4c07f6 is described below commit 3fd38d4c07f6c998ec8bb234796f83a6aecfc0d2 Author: Chenhao Li AuthorDate: Thu May 9 22:45:10 2024 +0800 [SPARK-47803][FOLLOWUP] Check nulls when casting nested type to variant ### What changes were proposed in this pull request? It adds null checks when accessing a nested element when casting a nested type to variant. It is necessary because the `get` API doesn't guarantee to return null when the slot is null. For example, `ColumnarArray.get` may return the default value of a primitive type if the slot is null. ### Why are the changes needed? It is a bug fix is necessary for the cast-to-variant expression to work correctly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Two new unit tests. One directly uses `ColumnarArray` as the input of the cast. The other creates a real-world situation where `ColumnarArray` is the input of the cast (scan). Both of them would fail without the code change in this PR. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46486 from chenhao-db/fix_cast_nested_to_variant. Authored-by: Chenhao Li Signed-off-by: Wenchen Fan --- .../variant/VariantExpressionEvalUtils.scala | 9 -- .../apache/spark/sql/VariantEndToEndSuite.scala| 33 -- 2 files changed, 37 insertions(+), 5 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtils.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtils.scala index eb235eb854e0..f7f7097173bb 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtils.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtils.scala @@ -103,7 +103,8 @@ object VariantExpressionEvalUtils { val offsets = new java.util.ArrayList[java.lang.Integer](data.numElements()) for (i <- 0 until data.numElements()) { offsets.add(builder.getWritePos - start) - buildVariant(builder, data.get(i, elementType), elementType) + val element = if (data.isNullAt(i)) null else data.get(i, elementType) + buildVariant(builder, element, elementType) } builder.finishWritingArray(start, offsets) case MapType(StringType, valueType, _) => @@ -116,7 +117,8 @@ object VariantExpressionEvalUtils { val key = keys.getUTF8String(i).toString val id = builder.addKey(key) fields.add(new VariantBuilder.FieldEntry(key, id, builder.getWritePos - start)) - buildVariant(builder, values.get(i, valueType), valueType) + val value = if (values.isNullAt(i)) null else values.get(i, valueType) + buildVariant(builder, value, valueType) } builder.finishWritingObject(start, fields) case StructType(structFields) => @@ -127,7 +129,8 @@ object VariantExpressionEvalUtils { val key = structFields(i).name val id = builder.addKey(key) fields.add(new VariantBuilder.FieldEntry(key, id, builder.getWritePos - start)) - buildVariant(builder, data.get(i, structFields(i).dataType), structFields(i).dataType) + val value = if (data.isNullAt(i)) null else data.get(i, structFields(i).dataType) + buildVariant(builder, value, structFields(i).dataType) } builder.finishWritingObject(start, fields) } diff --git a/sql/core/src/test/scala/org/apache/spark/sql/VariantEndToEndSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/VariantEndToEndSuite.scala index 3964bf3aedec..53be9d50d351 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/VariantEndToEndSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/VariantEndToEndSuite.scala @@ -16,11 +16,13 @@ */ package org.apache.spark.sql -import org.apache.spark.sql.catalyst.expressions.{CreateArray, CreateNamedStruct, JsonToStructs, Literal, StructsToJson} +import org.apache.spark.sql.catalyst.expressions.{Cast, CreateArray, CreateNamedStruct, JsonToStructs, Literal, StructsToJson} import org.apache.spark.sql.catalyst.expressions.variant.ParseJson import org.apache.spark.sql.execution.WholeStageCodegenExec +import org.apache.spark.sql.execution.vectorized.OnHeapColumnVector import org.apache.spark.sql.test.SharedSparkSession -import org.apache.spark.sql.types.VariantType
(spark) branch master updated: [SPARK-48211][SQL] DB2: Read SMALLINT as ShortType
This is an automated email from the ASF dual-hosted git repository. yao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 207d675110e6 [SPARK-48211][SQL] DB2: Read SMALLINT as ShortType 207d675110e6 is described below commit 207d675110e6fa699a434e81296f6f050eb0304b Author: Kent Yao AuthorDate: Thu May 9 17:27:04 2024 +0800 [SPARK-48211][SQL] DB2: Read SMALLINT as ShortType ### What changes were proposed in this pull request? This PR supports read SMALLINT from DB2 as ShortType ### Why are the changes needed? - 15 bits is sufficient - we write ShortType as SMALLINT - we read smallint from other builtin jdbc sources as ShortType ### Does this PR introduce _any_ user-facing change? yes, we add a migration guide for this ### How was this patch tested? changed tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #46497 from yaooqinn/SPARK-48211. Authored-by: Kent Yao Signed-off-by: Kent Yao --- .../spark/sql/jdbc/DB2IntegrationSuite.scala | 69 +- docs/sql-migration-guide.md| 1 + .../org/apache/spark/sql/internal/SQLConf.scala| 11 .../org/apache/spark/sql/jdbc/DB2Dialect.scala | 3 + 4 files changed, 56 insertions(+), 28 deletions(-) diff --git a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala index cedb33d491fb..aca174cce194 100644 --- a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala +++ b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala @@ -25,6 +25,7 @@ import org.scalatest.time.SpanSugar._ import org.apache.spark.sql.{Row, SaveMode} import org.apache.spark.sql.catalyst.util.DateTimeTestUtils._ +import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types.{BooleanType, ByteType, ShortType, StructType} import org.apache.spark.tags.DockerTest @@ -77,32 +78,44 @@ class DB2IntegrationSuite extends DockerJDBCIntegrationSuite { } test("Numeric types") { -val df = sqlContext.read.jdbc(jdbcUrl, "numbers", new Properties) -val rows = df.collect() -assert(rows.length == 1) -val types = rows(0).toSeq.map(x => x.getClass.toString) -assert(types.length == 10) -assert(types(0).equals("class java.lang.Integer")) -assert(types(1).equals("class java.lang.Integer")) -assert(types(2).equals("class java.lang.Long")) -assert(types(3).equals("class java.math.BigDecimal")) -assert(types(4).equals("class java.lang.Double")) -assert(types(5).equals("class java.lang.Double")) -assert(types(6).equals("class java.lang.Float")) -assert(types(7).equals("class java.math.BigDecimal")) -assert(types(8).equals("class java.math.BigDecimal")) -assert(types(9).equals("class java.math.BigDecimal")) -assert(rows(0).getInt(0) == 17) -assert(rows(0).getInt(1) == 7) -assert(rows(0).getLong(2) == 922337203685477580L) -val bd = new BigDecimal("123456745.567890123450") -assert(rows(0).getAs[BigDecimal](3).equals(bd)) -assert(rows(0).getDouble(4) == 42.75) -assert(rows(0).getDouble(5) == 5.4E-70) -assert(rows(0).getFloat(6) == 3.4028234663852886e+38) -assert(rows(0).getDecimal(7) == new BigDecimal("4.299900")) -assert(rows(0).getDecimal(8) == new BigDecimal(".00")) -assert(rows(0).getDecimal(9) == new BigDecimal("1234567891234567.123456789123456789")) +Seq(true, false).foreach { legacy => + withSQLConf(SQLConf.LEGACY_DB2_TIMESTAMP_MAPPING_ENABLED.key -> legacy.toString) { +val df = sqlContext.read.jdbc(jdbcUrl, "numbers", new Properties) +val rows = df.collect() +assert(rows.length == 1) +val types = rows(0).toSeq.map(x => x.getClass.toString) +assert(types.length == 10) +if (legacy) { + assert(types(0).equals("class java.lang.Integer")) +} else { + assert(types(0).equals("class java.lang.Short")) +} +assert(types(1).equals("class java.lang.Integer")) +assert(types(2).equals("class java.lang.Long")) +assert(types(3).equals("class java.math.BigDecimal")) +assert(types(4).equals("class java.lang.Double")) +assert(types(5).equals("class java.lang.Double")) +assert(types(6).equals("class java.lang.Float")) +assert(types(7).equals("class java.math.BigDecimal")) +assert(types(8).equals("class java.math.BigDecimal")) +assert(types(9).equals("class java.math.BigDecimal")) +if
(spark) branch master updated (ecca1bf6453e -> 9e62dbad6a75)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from ecca1bf6453e [SPARK-47365][PYTHON] Add toArrow() DataFrame method to PySpark add 9e62dbad6a75 [SPARK-48212][PYTHON][CONNECT][TESTS] Fully enable `PandasUDFParityTests.test_udf_wrong_arg` No new revisions were added by this update. Summary of changes: python/pyspark/sql/tests/connect/test_parity_pandas_udf.py | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47365][PYTHON] Add toArrow() DataFrame method to PySpark
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ecca1bf6453e [SPARK-47365][PYTHON] Add toArrow() DataFrame method to PySpark ecca1bf6453e is described below commit ecca1bf6453e5e0042e1b56d4c35fb0b4d0f3121 Author: Ian Cook AuthorDate: Thu May 9 17:25:34 2024 +0900 [SPARK-47365][PYTHON] Add toArrow() DataFrame method to PySpark ### What changes were proposed in this pull request? - Add a PySpark DataFrame method `toArrow()` which returns the contents of the DataFrame as a [PyArrow Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html), for both local Spark and Spark Connect. - Add a new entry to the **Apache Arrow in PySpark** user guide page describing usage of the `toArrow()` method. - Add a new option to the method `_collect_as_arrow()` to provide more useful output when there are zero records returned. (This keeps the implementation of `toArrow()` simpler.) ### Why are the changes needed? In the Apache Arrow community, we hear from a lot of users who want to return the contents of a PySpark DataFrame as a PyArrow Table. Currently the only documented way to do this is to return the contents as a pandas DataFrame, then use PyArrow (`pa`) to convert that to a PyArrow Table. ```py pa.Table.from_pandas(df.toPandas()) ``` But going through pandas adds significant overhead which is easily avoided since internally `toPandas()` already converts the contents of Spark DataFrame to Arrow format as an intermediate step when `spark.sql.execution.arrow.pyspark.enabled` is `true`. Currently it is also possible to use the experimental `_collect_as_arrow()` method to return the contents of a PySpark DataFrame as a list of PyArrow RecordBatches. This PR adds a new non-experimental method `toArrow()` which returns the more user-friendly PyArrow Table object. This PR also adds a new argument `empty_list_if_zero_records` to the experimental method `_collect_as_arrow()` to control what the method returns in the case when the result data has zero rows. If set to `True` (the default), the existing behavior is preserved, and the method returns an empty Python list. If set to `False`, the method returns returns a length-one list containing an empty Arrow RecordBatch which includes the schema. This is used by `toArrow()` which requires the schema [...] For Spark Connect, there is already a `SparkSession.client.to_table()` method that returns a PyArrow table. This PR uses that to expose `toArrow()` for Spark Connect. ### Does this PR introduce _any_ user-facing change? - It adds a DataFrame method `toArrow()` to the PySpark SQL DataFrame API. - It adds a new argument `empty_list_if_zero_records` to the experimental DataFrame method `_collect_as_arrow()` with a default value which preserves the method's existing behavior. - It exposes `toArrow()` for Spark Connect, via the existing `SparkSession.client.to_table()` method. - It does not introduce any other user-facing changes. ### How was this patch tested? This adds a new test and a new helper function for the test in `pyspark/sql/tests/test_arrow.py`. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45481 from ianmcook/SPARK-47365. Lead-authored-by: Ian Cook Co-authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon --- examples/src/main/python/sql/arrow.py | 18 .../source/reference/pyspark.sql/dataframe.rst | 1 + python/docs/source/user_guide/sql/arrow_pandas.rst | 49 +++--- python/pyspark/sql/classic/dataframe.py| 4 ++ python/pyspark/sql/connect/dataframe.py| 4 ++ python/pyspark/sql/dataframe.py| 30 + python/pyspark/sql/pandas/conversion.py| 48 +++-- python/pyspark/sql/tests/test_arrow.py | 35 8 files changed, 169 insertions(+), 20 deletions(-) diff --git a/examples/src/main/python/sql/arrow.py b/examples/src/main/python/sql/arrow.py index 03daf18eadbf..48aee48d929c 100644 --- a/examples/src/main/python/sql/arrow.py +++ b/examples/src/main/python/sql/arrow.py @@ -33,6 +33,22 @@ require_minimum_pandas_version() require_minimum_pyarrow_version() +def dataframe_to_arrow_table_example(spark: SparkSession) -> None: +import pyarrow as pa # noqa: F401 +from pyspark.sql.functions import rand + +# Create a Spark DataFrame +df = spark.range(100).drop("id").withColumns({"0": rand(), "1": rand(), "2": rand()}) + +# Convert the Spark DataFrame to a PyArrow Table +table = df.select("*").toArrow() + +print(table.schema) +# 0: double not null +
(spark) branch master updated (34ee0d8414b2 -> 027327d94b34)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 34ee0d8414b2 [SPARK-47421][SQL] Add collation support for URL expressions add 027327d94b34 [SPARK-47986][CONNECT][PYTHON] Unable to create a new session when the default session is closed by the server No new revisions were added by this update. Summary of changes: python/pyspark/sql/connect/client/core.py | 1 + python/pyspark/sql/connect/session.py | 20 .../sql/tests/connect/test_connect_session.py | 16 - python/pyspark/sql/tests/connect/test_session.py | 28 ++ 4 files changed, 59 insertions(+), 6 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (045ec6a166c8 -> 34ee0d8414b2)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 045ec6a166c8 [SPARK-48208][SS] Skip providing memory usage metrics from RocksDB if bounded memory usage is enabled add 34ee0d8414b2 [SPARK-47421][SQL] Add collation support for URL expressions No new revisions were added by this update. Summary of changes: .../explain-results/function_url_decode.explain| 2 +- .../explain-results/function_url_encode.explain| 2 +- .../sql/catalyst/expressions/urlExpressions.scala | 19 ++-- .../spark/sql/CollationSQLExpressionsSuite.scala | 103 + 4 files changed, 115 insertions(+), 11 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48208][SS] Skip providing memory usage metrics from RocksDB if bounded memory usage is enabled
This is an automated email from the ASF dual-hosted git repository. kabhwan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 045ec6a166c8 [SPARK-48208][SS] Skip providing memory usage metrics from RocksDB if bounded memory usage is enabled 045ec6a166c8 is described below commit 045ec6a166c8d2bdf73585fc4160c136e5f2888a Author: Anish Shrigondekar AuthorDate: Thu May 9 17:10:01 2024 +0900 [SPARK-48208][SS] Skip providing memory usage metrics from RocksDB if bounded memory usage is enabled ### What changes were proposed in this pull request? Skip providing memory usage metrics from RocksDB if bounded memory usage is enabled ### Why are the changes needed? Without this, we are providing memory usage that is the max usage per node at a partition level. For eg - if we report this ``` "allRemovalsTimeMs" : 93, "commitTimeMs" : 32240, "memoryUsedBytes" : 15956211724278, "numRowsDroppedByWatermark" : 0, "numShufflePartitions" : 200, "numStateStoreInstances" : 200, ``` We have 200 partitions in this case. So the memory usage per partition / state store would be ~78GB. However, this node has 256GB memory total and we have 2 such nodes. We have configured our cluster to use 30% of available memory on each node for RocksDB which is ~77GB. So the memory being reported here is actually per node rather than per partition which could be confusing for users. ### Does this PR introduce _any_ user-facing change? No - only a metrics reporting change ### How was this patch tested? Added unit tests ``` [info] Run completed in 10 seconds, 878 milliseconds. [info] Total number of tests run: 24 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 24, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #46491 from anishshri-db/task/SPARK-48208. Authored-by: Anish Shrigondekar Signed-off-by: Jungtaek Lim --- .../apache/spark/sql/execution/streaming/state/RocksDB.scala | 11 ++- .../spark/sql/execution/streaming/state/RocksDBSuite.scala| 11 +++ 2 files changed, 21 insertions(+), 1 deletion(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala index caecf817c12f..151695192281 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala @@ -777,10 +777,19 @@ class RocksDB( .keys.filter(checkInternalColumnFamilies(_)).size val numExternalColFamilies = colFamilyNameToHandleMap.keys.size - numInternalColFamilies +// if bounded memory usage is enabled, we share the block cache across all state providers +// running on the same node and account the usage to this single cache. In this case, its not +// possible to provide partition level or query level memory usage. +val memoryUsage = if (conf.boundedMemoryUsage) { + 0L +} else { + readerMemUsage + memTableMemUsage + blockCacheUsage +} + RocksDBMetrics( numKeysOnLoadedVersion, numKeysOnWritingVersion, - readerMemUsage + memTableMemUsage + blockCacheUsage, + memoryUsage, pinnedBlocksMemUsage, totalSSTFilesBytes, nativeOpsLatencyMicros, diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala index ab2afa1b8a61..6086fd43846f 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala @@ -1699,6 +1699,11 @@ class RocksDBSuite extends AlsoTestWithChangelogCheckpointingEnabled with Shared db.load(0) db.put("a", "1") db.commit() + if (boundedMemoryUsage == "true") { +assert(db.metricsOpt.get.totalMemUsageBytes === 0) + } else { +assert(db.metricsOpt.get.totalMemUsageBytes > 0) + } db.getWriteBufferManagerAndCache() } @@ -1709,6 +1714,11 @@ class RocksDBSuite extends AlsoTestWithChangelogCheckpointingEnabled with Shared db.load(0) db.put("a", "1") db.commit() + if (boundedMemoryUsage == "true") { +assert(db.metricsOpt.get.totalMemUsageBytes === 0) + } else { +
(spark) branch master updated (a4ab82b8f340 -> 91da4ac25148)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from a4ab82b8f340 [SPARK-48186][SQL] Add support for AbstractMapType add 91da4ac25148 [SPARK-47354][SQL] Add collation support for variant expressions No new revisions were added by this update. Summary of changes: .../function_is_variant_null.explain | 2 +- .../explain-results/function_parse_json.explain| 2 +- .../function_schema_of_variant.explain | 2 +- .../function_schema_of_variant_agg.explain | 2 +- .../function_try_parse_json.explain| 2 +- .../function_try_variant_get.explain | 2 +- .../explain-results/function_variant_get.explain | 2 +- .../expressions/variant/variantExpressions.scala | 23 +- .../spark/sql/CollationSQLExpressionsSuite.scala | 293 - 9 files changed, 312 insertions(+), 18 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (6cc3dc2ef4d2 -> a4ab82b8f340)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 6cc3dc2ef4d2 [SPARK-48169][SPARK-48143][SQL] Revert BadRecordException optimizations add a4ab82b8f340 [SPARK-48186][SQL] Add support for AbstractMapType No new revisions were added by this update. Summary of changes: ...stractArrayType.scala => AbstractMapType.scala} | 20 - .../spark/sql/catalyst/expressions/ExprUtils.scala | 9 +++--- .../sql/catalyst/expressions/jsonExpressions.scala | 5 ++-- .../spark/sql/CollationSQLExpressionsSuite.scala | 34 ++ 4 files changed, 55 insertions(+), 13 deletions(-) copy sql/api/src/main/scala/org/apache/spark/sql/internal/types/{AbstractArrayType.scala => AbstractMapType.scala} (58%) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48169][SPARK-48143][SQL] Revert BadRecordException optimizations
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 6cc3dc2ef4d2 [SPARK-48169][SPARK-48143][SQL] Revert BadRecordException optimizations 6cc3dc2ef4d2 is described below commit 6cc3dc2ef4d2ffbff7ffc400e723b97b462e1bab Author: Vladimir Golubev AuthorDate: Thu May 9 15:35:28 2024 +0800 [SPARK-48169][SPARK-48143][SQL] Revert BadRecordException optimizations ### What changes were proposed in this pull request? Revert BadRecordException optimizations for UnivocityParser, StaxXmlParser and JacksonParser ### Why are the changes needed? To reduce the blast radius - this will be implemented differently. There were two PRs by me recently: - https://github.com/apache/spark/pull/46438 - https://github.com/apache/spark/pull/46400 which introduced optimizations to speed-up control flow between UnivocityParser, StaxXmlParser and JacksonParser. However, these changes are quite unstable and may break any calling code, which relies on exception cause type, for example. Also, there may be some Spark plugins/extensions using that exception for user-facing errors ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? No Closes #46478 from vladimirg-db/vladimirg-db/revert-SPARK-48169-SPARK-48143. Authored-by: Vladimir Golubev Signed-off-by: Wenchen Fan --- .../spark/sql/catalyst/csv/UnivocityParser.scala | 8 .../spark/sql/catalyst/json/JacksonParser.scala| 13 ++-- .../sql/catalyst/util/BadRecordException.scala | 13 +++- .../sql/catalyst/util/FailureSafeParser.scala | 2 +- .../spark/sql/catalyst/xml/StaxXmlParser.scala | 23 +++--- 5 files changed, 26 insertions(+), 33 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala index 8d06789a7512..a5158d8a22c6 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala @@ -316,17 +316,17 @@ class UnivocityParser( throw BadRecordException( () => getCurrentInput, () => Array.empty, -() => QueryExecutionErrors.malformedCSVRecordError("")) +QueryExecutionErrors.malformedCSVRecordError("")) } val currentInput = getCurrentInput -var badRecordException: Option[() => Throwable] = if (tokens.length != parsedSchema.length) { +var badRecordException: Option[Throwable] = if (tokens.length != parsedSchema.length) { // If the number of tokens doesn't match the schema, we should treat it as a malformed record. // However, we still have chance to parse some of the tokens. It continues to parses the // tokens normally and sets null when `ArrayIndexOutOfBoundsException` occurs for missing // tokens. - Some(() => QueryExecutionErrors.malformedCSVRecordError(currentInput.toString)) + Some(QueryExecutionErrors.malformedCSVRecordError(currentInput.toString)) } else None // When the length of the returned tokens is identical to the length of the parsed schema, // we just need to: @@ -348,7 +348,7 @@ class UnivocityParser( } catch { case e: SparkUpgradeException => throw e case NonFatal(e) => - badRecordException = badRecordException.orElse(Some(() => e)) + badRecordException = badRecordException.orElse(Some(e)) // Use the corresponding DEFAULT value associated with the column, if any. row.update(i, ResolveDefaultColumns.existenceDefaultValues(requiredSchema)(i)) } diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala index 848c20ee36be..5e75ff6f6e1a 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala @@ -613,7 +613,7 @@ class JacksonParser( // JSON parser currently doesn't support partial results for corrupted records. // For such records, all fields other than the field configured by // `columnNameOfCorruptRecord` are set to `null`. -throw BadRecordException(() => recordLiteral(record), cause = () => e) +throw BadRecordException(() => recordLiteral(record), () => Array.empty, e) case e: CharConversionException if options.encoding.isEmpty =>