(spark) branch master updated: [SPARK-46812][CONNECT][PYTHON][FOLLOW-UP] Add pyspark.pyspark.sql.connect.resource into PyPi packaging
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8c2c0fb43a9d [SPARK-46812][CONNECT][PYTHON][FOLLOW-UP] Add pyspark.pyspark.sql.connect.resource into PyPi packaging 8c2c0fb43a9d is described below commit 8c2c0fb43a9d3c5bffcf33aeb3354c01fe6b26cd Author: Hyukjin Kwon AuthorDate: Wed Apr 17 15:27:23 2024 +0900 [SPARK-46812][CONNECT][PYTHON][FOLLOW-UP] Add pyspark.pyspark.sql.connect.resource into PyPi packaging ### What changes were proposed in this pull request? This PR proposes to add `pyspark.pyspark.sql.connect.resource` into PyPi packaging. ### Why are the changes needed? In order for PyPI end users to download PySpark and leverage this feature. ### Does this PR introduce _any_ user-facing change? No, the main change has not been released. ### How was this patch tested? Being tested at https://github.com/apache/spark/pull/46090 ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46094 from HyukjinKwon/SPARK-46812-followup. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon --- python/packaging/classic/setup.py | 1 + python/packaging/connect/setup.py | 1 + 2 files changed, 2 insertions(+) diff --git a/python/packaging/classic/setup.py b/python/packaging/classic/setup.py index 8eefc17db700..f900fa6e15ee 100755 --- a/python/packaging/classic/setup.py +++ b/python/packaging/classic/setup.py @@ -276,6 +276,7 @@ try: "pyspark.sql.connect.functions", "pyspark.sql.connect.proto", "pyspark.sql.connect.protobuf", +"pyspark.sql.connect.resource", "pyspark.sql.connect.shell", "pyspark.sql.connect.streaming", "pyspark.sql.connect.streaming.worker", diff --git a/python/packaging/connect/setup.py b/python/packaging/connect/setup.py index fe1e7486faa9..19925962804b 100755 --- a/python/packaging/connect/setup.py +++ b/python/packaging/connect/setup.py @@ -145,6 +145,7 @@ try: "pyspark.sql.connect.functions", "pyspark.sql.connect.proto", "pyspark.sql.connect.protobuf", +"pyspark.sql.connect.resource", "pyspark.sql.connect.shell", "pyspark.sql.connect.streaming", "pyspark.sql.connect.streaming.worker", - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47588][CORE] Hive module: Migrate logInfo with variables to structured logging framework
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f77495909b29 [SPARK-47588][CORE] Hive module: Migrate logInfo with variables to structured logging framework f77495909b29 is described below commit f77495909b29fe4883afcfd8fec7be048fe494a3 Author: Gengliang Wang AuthorDate: Tue Apr 16 22:32:34 2024 -0700 [SPARK-47588][CORE] Hive module: Migrate logInfo with variables to structured logging framework ### What changes were proposed in this pull request? Migrate logInfo in Hive module with variables to structured logging framework. ### Why are the changes needed? To enhance Apache Spark's logging system by implementing structured logging. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #46086 from gengliangwang/hive_loginfo. Authored-by: Gengliang Wang Signed-off-by: Gengliang Wang --- .../scala/org/apache/spark/internal/LogKey.scala | 4 +++ .../spark/sql/hive/HiveExternalCatalog.scala | 30 -- .../spark/sql/hive/HiveMetastoreCatalog.scala | 9 --- .../org/apache/spark/sql/hive/HiveUtils.scala | 27 +++ .../spark/sql/hive/client/HiveClientImpl.scala | 5 ++-- .../sql/hive/client/IsolatedClientLoader.scala | 4 +-- .../spark/sql/hive/orc/OrcFileOperator.scala | 4 +-- 7 files changed, 48 insertions(+), 35 deletions(-) diff --git a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala index bfeb733af30a..838ef0355e3a 100644 --- a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala +++ b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala @@ -95,10 +95,13 @@ object LogKey extends Enumeration { val GROUP_ID = Value val HADOOP_VERSION = Value val HISTORY_DIR = Value + val HIVE_CLIENT_VERSION = Value + val HIVE_METASTORE_VERSION = Value val HIVE_OPERATION_STATE = Value val HIVE_OPERATION_TYPE = Value val HOST = Value val HOST_PORT = Value + val INCOMPATIBLE_TYPES = Value val INDEX = Value val INFERENCE_MODE = Value val INITIAL_CAPACITY = Value @@ -152,6 +155,7 @@ object LogKey extends Enumeration { val POLICY = Value val PORT = Value val PRODUCER_ID = Value + val PROVIDER = Value val QUERY_CACHE_VALUE = Value val QUERY_HINT = Value val QUERY_ID = Value diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala index 8c35e10b383f..60f2d2f3e5fe 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala @@ -34,7 +34,7 @@ import org.apache.thrift.TException import org.apache.spark.{SparkConf, SparkException} import org.apache.spark.internal.{Logging, MDC} -import org.apache.spark.internal.LogKey.{DATABASE_NAME, SCHEMA, SCHEMA2, TABLE_NAME} +import org.apache.spark.internal.LogKey.{DATABASE_NAME, INCOMPATIBLE_TYPES, PROVIDER, SCHEMA, SCHEMA2, TABLE_NAME} import org.apache.spark.sql.AnalysisException import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException @@ -338,35 +338,37 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat val (hiveCompatibleTable, logMessage) = maybeSerde match { case _ if options.skipHiveMetadata => val message = - s"Persisting data source table $qualifiedTableName into Hive metastore in" + -"Spark SQL specific format, which is NOT compatible with Hive." + log"Persisting data source table ${MDC(TABLE_NAME, qualifiedTableName)} into Hive " + +log"metastore in Spark SQL specific format, which is NOT compatible with Hive." (None, message) case _ if incompatibleTypes.nonEmpty => +val incompatibleTypesStr = incompatibleTypes.mkString(", ") val message = - s"Hive incompatible types found: ${incompatibleTypes.mkString(", ")}. " + -s"Persisting data source table $qualifiedTableName into Hive metastore in " + -"Spark SQL specific format, which is NOT compatible with Hive." + log"Hive incompatible types found: ${MDC(INCOMPATIBLE_TYPES, incompatibleTypesStr)}. " + +log"Persisting data source table ${MDC(TABLE_NAME, qualifiedTableName)} into Hive " + +log"metastore in Spark SQL specific format, which is NOT compatible with Hive."
(spark) branch master updated (268856da31c1 -> 2054ab0fb03f)
This is an automated email from the ASF dual-hosted git repository. yao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 268856da31c1 [SPARK-47879][SQL] Oracle: Use VARCHAR2 instead of VARCHAR for VarcharType mapping add 2054ab0fb03f [SPARK-47880][SQL][DOCS] Oracle: Document Mapping Spark SQL Data Types to Oracle No new revisions were added by this update. Summary of changes: docs/sql-data-sources-jdbc.md | 106 ++ 1 file changed, 106 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (ab6338e09aa0 -> 268856da31c1)
This is an automated email from the ASF dual-hosted git repository. yao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from ab6338e09aa0 [SPARK-47838][BUILD] Upgrade `rocksdbjni` to 8.11.4 add 268856da31c1 [SPARK-47879][SQL] Oracle: Use VARCHAR2 instead of VARCHAR for VarcharType mapping No new revisions were added by this update. Summary of changes: .../org/apache/spark/sql/jdbc/v2/OracleIntegrationSuite.scala | 11 ++- .../main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala | 1 + .../src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala | 1 + 3 files changed, 12 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47838][BUILD] Upgrade `rocksdbjni` to 8.11.4
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ab6338e09aa0 [SPARK-47838][BUILD] Upgrade `rocksdbjni` to 8.11.4 ab6338e09aa0 is described below commit ab6338e09aa0fe06aef1c753eaaf677f766e9490 Author: Neil Ramaswamy AuthorDate: Tue Apr 16 20:11:16 2024 -0700 [SPARK-47838][BUILD] Upgrade `rocksdbjni` to 8.11.4 ### What changes were proposed in this pull request? Upgrades `rocksdbjni` dependency to 8.11.4. ### Why are the changes needed? 8.11.4 has Java-related RocksDB fixes: https://github.com/facebook/rocksdb/releases/tag/v8.11.4 - Fixed CMake Javadoc build - Fixed Java SstFileMetaData to prevent throwing java.lang.NoSuchMethodError ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - All existing UTs should pass - [In progress] Performance benchmarks ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46065 from neilramaswamy/spark-47838. Authored-by: Neil Ramaswamy Signed-off-by: Dongjoon Hyun --- dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 +- pom.xml| 2 +- ...StoreBasicOperationsBenchmark-jdk21-results.txt | 122 +++-- .../StateStoreBasicOperationsBenchmark-results.txt | 122 +++-- 4 files changed, 126 insertions(+), 122 deletions(-) diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 b/dev/deps/spark-deps-hadoop-3-hive-2.3 index 466e8d09d89e..54e54a108904 100644 --- a/dev/deps/spark-deps-hadoop-3-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3-hive-2.3 @@ -247,7 +247,7 @@ parquet-jackson/1.13.1//parquet-jackson-1.13.1.jar pickle/1.3//pickle-1.3.jar py4j/0.10.9.7//py4j-0.10.9.7.jar remotetea-oncrpc/1.1.2//remotetea-oncrpc-1.1.2.jar -rocksdbjni/8.11.3//rocksdbjni-8.11.3.jar +rocksdbjni/8.11.4//rocksdbjni-8.11.4.jar scala-collection-compat_2.13/2.7.0//scala-collection-compat_2.13-2.7.0.jar scala-compiler/2.13.13//scala-compiler-2.13.13.jar scala-library/2.13.13//scala-library-2.13.13.jar diff --git a/pom.xml b/pom.xml index bf8d4f1b417d..7ded74b9f9df 100644 --- a/pom.xml +++ b/pom.xml @@ -687,7 +687,7 @@ org.rocksdb rocksdbjni -8.11.3 +8.11.4 ${leveldbjni.group} diff --git a/sql/core/benchmarks/StateStoreBasicOperationsBenchmark-jdk21-results.txt b/sql/core/benchmarks/StateStoreBasicOperationsBenchmark-jdk21-results.txt index 0317e6116375..953031fc1daf 100644 --- a/sql/core/benchmarks/StateStoreBasicOperationsBenchmark-jdk21-results.txt +++ b/sql/core/benchmarks/StateStoreBasicOperationsBenchmark-jdk21-results.txt @@ -2,141 +2,143 @@ put rows -OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 6.5.0-1016-azure +OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 6.5.0-1017-azure AMD EPYC 7763 64-Core Processor putting 1 rows (1 rows to overwrite - rate 100): Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative --- -In-memory9 10 1 1.1 936.2 1.0X -RocksDB (trackTotalNumberOfRows: true) 41 42 1 0.24068.9 0.2X -RocksDB (trackTotalNumberOfRows: false) 15 16 1 0.71500.4 0.6X +In-memory9 10 1 1.1 938.9 1.0X +RocksDB (trackTotalNumberOfRows: true) 42 44 2 0.24215.2 0.2X +RocksDB (trackTotalNumberOfRows: false) 15 16 1 0.71535.3 0.6X -OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 6.5.0-1016-azure +OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 6.5.0-1017-azure AMD EPYC 7763 64-Core Processor putting 1 rows (5000 rows to overwrite - rate 50): Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative - -In-memory 9 11 1 1.1 929.8 1.0X -RocksDB (trackTotalNumberOfRows: true)40
(spark) branch master updated (f9ebe1b3d24b -> 6c827c10dc15)
This is an automated email from the ASF dual-hosted git repository. xinrong pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from f9ebe1b3d24b [SPARK-46375][DOCS] Add user guide for Python data source API add 6c827c10dc15 [SPARK-47876][PYTHON][DOCS] Improve docstring of mapInArrow No new revisions were added by this update. Summary of changes: python/pyspark/sql/pandas/map_ops.py | 19 +-- 1 file changed, 9 insertions(+), 10 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46375][DOCS] Add user guide for Python data source API
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f9ebe1b3d24b [SPARK-46375][DOCS] Add user guide for Python data source API f9ebe1b3d24b is described below commit f9ebe1b3d24b126784b3bb65d1eb710a74cf63de Author: allisonwang-db AuthorDate: Wed Apr 17 09:54:42 2024 +0900 [SPARK-46375][DOCS] Add user guide for Python data source API ### What changes were proposed in this pull request? This PR adds a new user guide for the Python data source API with a simple example. More examples (including streaming) will be added in the future. ### Why are the changes needed? To improve the documentation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? doctest ### Was this patch authored or co-authored using generative AI tooling? No Closes #46089 from allisonwang-db/spark-46375-pyds-user-guide. Authored-by: allisonwang-db Signed-off-by: Hyukjin Kwon --- python/docs/source/user_guide/sql/index.rst| 1 + .../source/user_guide/sql/python_data_source.rst | 139 + 2 files changed, 140 insertions(+) diff --git a/python/docs/source/user_guide/sql/index.rst b/python/docs/source/user_guide/sql/index.rst index 118cf139d9b3..d1b67f7eeb90 100644 --- a/python/docs/source/user_guide/sql/index.rst +++ b/python/docs/source/user_guide/sql/index.rst @@ -25,5 +25,6 @@ Spark SQL arrow_pandas python_udtf + python_data_source type_conversions diff --git a/python/docs/source/user_guide/sql/python_data_source.rst b/python/docs/source/user_guide/sql/python_data_source.rst new file mode 100644 index ..19ed016b82c2 --- /dev/null +++ b/python/docs/source/user_guide/sql/python_data_source.rst @@ -0,0 +1,139 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + +..http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +== +Python Data Source API +== + +.. currentmodule:: pyspark.sql + +Overview + +The Python Data Source API is a new feature introduced in Spark 4.0, enabling developers to read from custom data sources and write to custom data sinks in Python. +This guide provides a comprehensive overview of the API and instructions on how to create, use, and manage Python data sources. + + +Creating a Python Data Source +- +To create a custom Python data source, you'll need to subclass the :class:`DataSource` base classes and implement the necessary methods for reading and writing data. + +This example demonstrates creating a simple data source to generate synthetic data using the `faker` library. Ensure the `faker` library is installed and accessible in your Python environment. + +**Step 1: Define the Data Source** + +Start by creating a new subclass of :class:`DataSource`. Define the source name, schema, and reader logic as follows: + +.. code-block:: python + +from pyspark.sql.datasource import DataSource, DataSourceReader +from pyspark.sql.types import StructType + +class FakeDataSource(DataSource): +""" +A fake data source for PySpark to generate synthetic data using the `faker` library. +Options: +- numRows: specify number of rows to generate. Default value is 3. +""" + +@classmethod +def name(cls): +return "fake" + +def schema(self): +return "name string, date string, zipcode string, state string" + +def reader(self, schema: StructType): +return FakeDataSourceReader(schema, self.options) + + +**Step 2: Implement the Reader** + +Define the reader logic to generate synthetic data. Use the `faker` library to populate each field in the schema. + +.. code-block:: python + +class FakeDataSourceReader(DataSourceReader): + +def __init__(self, schema, options): +self.schema: StructType = schema +self.options = options + +def read(self, partition): +from faker import Fa
(spark) branch master updated (5321353b24db -> 86837d3155b1)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 5321353b24db [SPARK-47875][CORE] Remove `spark.deploy.recoverySerializer` add 86837d3155b1 [SPARK-47877][SS][CONNECT] Speed up test_parity_listener No new revisions were added by this update. Summary of changes: .../connect/streaming/test_parity_listener.py | 119 +++-- .../sql/tests/streaming/test_streaming_listener.py | 21 ++-- 2 files changed, 71 insertions(+), 69 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47875][CORE] Remove `spark.deploy.recoverySerializer`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 5321353b24db [SPARK-47875][CORE] Remove `spark.deploy.recoverySerializer` 5321353b24db is described below commit 5321353b24db247087890c44de06b9ad4e136473 Author: Dongjoon Hyun AuthorDate: Tue Apr 16 16:47:23 2024 -0700 [SPARK-47875][CORE] Remove `spark.deploy.recoverySerializer` ### What changes were proposed in this pull request? This is a logical revert of SPARK-46205 - #44113 - #44118 ### Why are the changes needed? The initial implementation didn't handle the class initialization logic properly. Until we have a fix, I'd like to revert this from `master` branch. ### Does this PR introduce _any_ user-facing change? No, this is not released yet. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46087 from dongjoon-hyun/SPARK-47875. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../PersistenceEngineBenchmark-jdk21-results.txt | 7 -- .../PersistenceEngineBenchmark-results.txt | 7 -- .../org/apache/spark/deploy/master/Master.scala| 7 ++ .../org/apache/spark/internal/config/Deploy.scala | 14 .../deploy/master/PersistenceEngineBenchmark.scala | 4 ++-- .../deploy/master/PersistenceEngineSuite.scala | 14 +--- .../apache/spark/deploy/master/RecoverySuite.scala | 25 ++ docs/spark-standalone.md | 12 ++- 8 files changed, 9 insertions(+), 81 deletions(-) diff --git a/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt b/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt index 2a6bd778fc8a..ae4e0071adb0 100644 --- a/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt +++ b/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt @@ -7,19 +7,12 @@ AMD EPYC 7763 64-Core Processor 1000 Workers: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative ZooKeeperPersistenceEngine with JavaSerializer 5036 5232 229 0.0 5035730.1 1.0X -ZooKeeperPersistenceEngine with KryoSerializer 4038 4053 16 0.0 4038447.8 1.2X FileSystemPersistenceEngine with JavaSerializer2902 2906 5 0.0 2902453.3 1.7X FileSystemPersistenceEngine with JavaSerializer (lz4) 816 829 19 0.0 816173.1 6.2X FileSystemPersistenceEngine with JavaSerializer (lzf) 755 780 33 0.0 755209.0 6.7X FileSystemPersistenceEngine with JavaSerializer (snappy)814 832 16 0.0 813672.5 6.2X FileSystemPersistenceEngine with JavaSerializer (zstd) 987 1014 45 0.0 986834.7 5.1X -FileSystemPersistenceEngine with KryoSerializer 687 698 14 0.0 687313.5 7.3X -FileSystemPersistenceEngine with KryoSerializer (lz4) 590 599 15 0.0 589867.9 8.5X -FileSystemPersistenceEngine with KryoSerializer (lzf) 915 922 9 0.0 915432.2 5.5X -FileSystemPersistenceEngine with KryoSerializer (snappy)768 795 37 0.0 768494.4 6.6X -FileSystemPersistenceEngine with KryoSerializer (zstd) 898 950 45 0.0 898118.6 5.6X RocksDBPersistenceEngine with JavaSerializer299 299 0 0.0 298800.0 16.9X -RocksDBPersistenceEngine with KryoSerializer112 113 1 0.0 111779.6 45.1X BlackHolePersistenceEngine0 0 0 5.5 180.3 27924.2X diff --git a/core/benchmarks/PersistenceEngineBenchmark-results.txt b/core/benchmarks/PersistenceEngineBenchmark-results.txt index da1838608de1..ec9a6fc1c8cf 100644 --- a/core/benchmarks/PersistenceEngineBenchmark-results.txt +++ b/core/benchmarks/PersistenceEngineBenchmark-results.txt @@ -7,19 +7,12 @@ AMD EPYC 7763 64-Core Processor 1000 Workers: Best Time
(spark) branch master updated: [SPARK-47760][SPARK-47763][CONNECT][TESTS] Reeanble Avro and Protobuf function doctests
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 57c7db2c4c1d [SPARK-47760][SPARK-47763][CONNECT][TESTS] Reeanble Avro and Protobuf function doctests 57c7db2c4c1d is described below commit 57c7db2c4c1dbeeba062fe28ab58245e0a3098eb Author: Hyukjin Kwon AuthorDate: Wed Apr 17 08:47:01 2024 +0900 [SPARK-47760][SPARK-47763][CONNECT][TESTS] Reeanble Avro and Protobuf function doctests ### What changes were proposed in this pull request? This PR proposes to reeanble Avro and Protobuf function doctests by providing the required jars into Spark Connect server. ### Why are the changes needed? For test coverages of Avro and Protobuf functions. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Tested in my fork: https://github.com/HyukjinKwon/spark/actions/runs/8704014674/job/23871383802 ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46055 from HyukjinKwon/SPARK-47763-SPARK-47760. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon --- .github/workflows/build_python_connect.yml | 11 ++- python/pyspark/sql/connect/avro/functions.py | 7 --- python/pyspark/sql/connect/protobuf/functions.py | 7 --- 3 files changed, 6 insertions(+), 19 deletions(-) diff --git a/.github/workflows/build_python_connect.yml b/.github/workflows/build_python_connect.yml index 965e839b6b2b..863980b0c2e5 100644 --- a/.github/workflows/build_python_connect.yml +++ b/.github/workflows/build_python_connect.yml @@ -29,7 +29,6 @@ jobs: name: "Build modules: pyspark-connect" runs-on: ubuntu-latest timeout-minutes: 300 -if: github.repository == 'apache/spark' steps: - name: Checkout Spark repository uses: actions/checkout@v4 @@ -63,7 +62,7 @@ jobs: architecture: x64 - name: Build Spark run: | - ./build/sbt -Phive test:package + ./build/sbt -Phive Test/package - name: Install pure Python package (pyspark-connect) env: SPARK_TESTING: 1 @@ -82,7 +81,9 @@ jobs: cp conf/log4j2.properties.template conf/log4j2.properties sed -i 's/rootLogger.level = info/rootLogger.level = warn/g' conf/log4j2.properties # Start a Spark Connect server - PYTHONPATH="python/lib/pyspark.zip:python/lib/py4j-0.10.9.7-src.zip:$PYTHONPATH" ./sbin/start-connect-server.sh --driver-java-options "-Dlog4j.configurationFile=file:$GITHUB_WORKSPACE/conf/log4j2.properties" --jars `find connector/connect/server/target -name spark-connect*SNAPSHOT.jar` + PYTHONPATH="python/lib/pyspark.zip:python/lib/py4j-0.10.9.7-src.zip:$PYTHONPATH" ./sbin/start-connect-server.sh \ +--driver-java-options "-Dlog4j.configurationFile=file:$GITHUB_WORKSPACE/conf/log4j2.properties" \ +--jars "`find connector/connect/server/target -name spark-connect-*SNAPSHOT.jar`,`find connector/protobuf/target -name spark-protobuf-*SNAPSHOT.jar`,`find connector/avro/target -name spark-avro*SNAPSHOT.jar`" # Make sure running Python workers that contains pyspark.core once. They will be reused. python -c "from pyspark.sql import SparkSession; _ = SparkSession.builder.remote('sc://localhost').getOrCreate().range(100).repartition(100).mapInPandas(lambda x: x, 'id INT').collect()" # Remove Py4J and PySpark zipped library to make sure there is no JVM connection @@ -98,9 +99,9 @@ jobs: with: name: test-results-spark-connect-python-only path: "**/target/test-reports/*.xml" - - name: Upload unit tests log files + - name: Upload Spark Connect server log file if: failure() uses: actions/upload-artifact@v4 with: name: unit-tests-log-spark-connect-python-only - path: "**/target/unit-tests.log" + path: logs/*.out diff --git a/python/pyspark/sql/connect/avro/functions.py b/python/pyspark/sql/connect/avro/functions.py index 43088333b108..f153b17acf58 100644 --- a/python/pyspark/sql/connect/avro/functions.py +++ b/python/pyspark/sql/connect/avro/functions.py @@ -80,15 +80,8 @@ def _test() -> None: import doctest from pyspark.sql import SparkSession as PySparkSession import pyspark.sql.connect.avro.functions -from pyspark.util import is_remote_only globs = pyspark.sql.connect.avro.functions.__dict__.copy() - -# TODO(SPARK-47760): Reeanble Avro function doctests -if is_remote_only(): -del pyspark.sql.connect.avro.functions.from_avro -del pyspark.sql.connect.avro.functions.to_avro - globs["spark"] = ( PySparkSession.buil
(spark) branch master updated: [SPARK-47868][CONNECT] Fix recursion limit error in SparkConnectPlanner and SparkSession
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 119271d49356 [SPARK-47868][CONNECT] Fix recursion limit error in SparkConnectPlanner and SparkSession 119271d49356 is described below commit 119271d4935605f15c358c52410dc20db40ace86 Author: Tom van Bussel AuthorDate: Wed Apr 17 07:33:24 2024 +0800 [SPARK-47868][CONNECT] Fix recursion limit error in SparkConnectPlanner and SparkSession ### What changes were proposed in this pull request? This PR adds a helper function to `ProtoUtils` that calls a proto parser on a byte array with an increased recursion limit. This helper function is used to enhance the `parseFrom` calls in `SparkSession` and `SparkConnectPlanner`. ### Why are the changes needed? Otherwise Spark Connect extensions will get the following exception: ``` org.apache.spark.SparkException: grpc_shaded.com.google.protobuf.InvalidProtocolBufferException: Protocol message had too many levels of nesting. May be malicious. Use CodedInputStream.setRecursionLimit() to increase the depth limit. ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests to make sure nothing breaks. Manually tested (using an extension that is currently in development) that this solves the exceptions. ### Was this patch authored or co-authored using generative AI tooling? No Closes #46075 from tomvanbussel/SPARK-47868. Authored-by: Tom van Bussel Signed-off-by: Ruifeng Zheng --- .../scala/org/apache/spark/sql/SparkSession.scala | 7 +- .../sql/connect/client/SparkConnectClient.scala| 10 - .../spark/sql/connect/common/ProtoUtils.scala | 25 -- .../sql/connect/common/config/ConnectCommon.scala | 3 ++- .../apache/spark/sql/connect/config/Connect.scala | 2 +- .../sql/connect/planner/SparkConnectPlanner.scala | 14 6 files changed, 51 insertions(+), 10 deletions(-) diff --git a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala index 5a2d9bc44c9f..54e29c80c728 100644 --- a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala +++ b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala @@ -39,6 +39,7 @@ import org.apache.spark.sql.catalyst.encoders.AgnosticEncoders.{BoxedLongEncoder import org.apache.spark.sql.connect.client.{ClassFinder, SparkConnectClient, SparkResult} import org.apache.spark.sql.connect.client.SparkConnectClient.Configuration import org.apache.spark.sql.connect.client.arrow.ArrowSerializer +import org.apache.spark.sql.connect.common.ProtoUtils import org.apache.spark.sql.functions.lit import org.apache.spark.sql.internal.{CatalogImpl, SqlApiConf} import org.apache.spark.sql.streaming.DataStreamReader @@ -586,9 +587,13 @@ class SparkSession private[sql] ( @DeveloperApi def execute(extension: Array[Byte]): Unit = { +val any = ProtoUtils.parseWithRecursionLimit( + extension, + com.google.protobuf.Any.parser(), + recursionLimit = client.configuration.grpcMaxRecursionLimit) val command = proto.Command .newBuilder() - .setExtension(com.google.protobuf.Any.parseFrom(extension)) + .setExtension(any) .build() execute(command) } diff --git a/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/SparkConnectClient.scala b/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/SparkConnectClient.scala index d9d51c15a880..1e7b4e6574dd 100644 --- a/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/SparkConnectClient.scala +++ b/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/SparkConnectClient.scala @@ -566,6 +566,13 @@ object SparkConnectClient { def grpcMaxMessageSize: Int = _configuration.grpcMaxMessageSize +def grpcMaxRecursionLimit(recursionLimit: Int): Builder = { + _configuration = _configuration.copy(grpcMaxRecursionLimit = recursionLimit) + this +} + +def grpcMaxRecursionLimit: Int = _configuration.grpcMaxRecursionLimit + def option(key: String, value: String): Builder = { _configuration = _configuration.copy(metadata = _configuration.metadata + ((key, value))) this @@ -703,7 +710,8 @@ object SparkConnectClient { useReattachableExecute: Boolean = true, interceptors: List[ClientInterceptor] = List.empty, sessionId: Option[String] = None, - grpcMaxMessageSize: Int = ConnectCommon.CONNECT_GRPC_MAX_MESSAGE_SIZE) { + grpcMaxMessag
Re: [PR] Fix title for spark 3.5.1 PySpark doc [spark-website]
panbingkun commented on PR #514: URL: https://github.com/apache/spark-website/pull/514#issuecomment-2060057109 cc @HeartSaVioR @dongjoon-hyun @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
Re: [PR] Fix title for spark 3.5.1 PySpark doc [spark-website]
panbingkun commented on PR #514: URL: https://github.com/apache/spark-website/pull/514#issuecomment-2060055831 https://github.com/apache/spark-website/assets/15246973/3f38fb82-a438-428f-8af6-e2851926c7b5";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[PR] Fix title for spark 3.5.1 PySpark doc [spark-website]
panbingkun opened a new pull request, #514: URL: https://github.com/apache/spark-website/pull/514 The pr aims to fix `title` for spark `3.5.1` PySpark doc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47590][SQL] Hive-thriftserver: Migrate logWarn with variables to structured logging framework
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f7440f384191 [SPARK-47590][SQL] Hive-thriftserver: Migrate logWarn with variables to structured logging framework f7440f384191 is described below commit f7440f3841918f2cdb4a8e710cfe31d3fc85230c Author: Haejoon Lee AuthorDate: Tue Apr 16 13:56:03 2024 -0700 [SPARK-47590][SQL] Hive-thriftserver: Migrate logWarn with variables to structured logging framework ### What changes were proposed in this pull request? This PR proposes to migrate `logWarning` with variables of Hive-thriftserver module to structured logging framework. ### Why are the changes needed? To improve the existing logging system by migrating into structured logging. ### Does this PR introduce _any_ user-facing change? No API changes, but the SQL catalyst logs will contain MDC(Mapped Diagnostic Context) from now. ### How was this patch tested? Run Scala auto formatting and style check. Also the existing CI should pass. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45923 from itholic/hive-ts-logwarn. Lead-authored-by: Haejoon Lee Co-authored-by: Haejoon Lee <44108233+itho...@users.noreply.github.com> Signed-off-by: Gengliang Wang --- .../scala/org/apache/spark/internal/LogKey.scala | 1 + .../SparkExecuteStatementOperation.scala | 4 ++- .../sql/hive/thriftserver/SparkSQLCLIDriver.scala | 15 - .../ui/HiveThriftServer2Listener.scala | 36 -- 4 files changed, 38 insertions(+), 18 deletions(-) diff --git a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala index 41289c641424..bfeb733af30a 100644 --- a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala +++ b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala @@ -94,6 +94,7 @@ object LogKey extends Enumeration { val FUNCTION_PARAMETER = Value val GROUP_ID = Value val HADOOP_VERSION = Value + val HISTORY_DIR = Value val HIVE_OPERATION_STATE = Value val HIVE_OPERATION_TYPE = Value val HOST = Value diff --git a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala index 628925007f7e..f8f58cd422b6 100644 --- a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala +++ b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala @@ -256,7 +256,9 @@ private[hive] class SparkExecuteStatementOperation( val currentState = getStatus().getState() if (currentState.isTerminal) { // This may happen if the execution was cancelled, and then closed from another thread. - logWarning(s"Ignore exception in terminal state with $statementId: $e") + logWarning( +log"Ignore exception in terminal state with ${MDC(STATEMENT_ID, statementId)}", e + ) } else { logError(log"Error executing query with ${MDC(STATEMENT_ID, statementId)}, " + log"currentState ${MDC(HIVE_OPERATION_STATE, currentState)}, ", e) diff --git a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala index 03d8fd0c8ff2..888c086e9042 100644 --- a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala +++ b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala @@ -41,7 +41,7 @@ import sun.misc.{Signal, SignalHandler} import org.apache.spark.{ErrorMessageFormat, SparkConf, SparkThrowable, SparkThrowableHelper} import org.apache.spark.deploy.SparkHadoopUtil import org.apache.spark.internal.{Logging, MDC} -import org.apache.spark.internal.LogKey.ERROR +import org.apache.spark.internal.LogKey._ import org.apache.spark.sql.AnalysisException import org.apache.spark.sql.catalyst.analysis.FunctionRegistry import org.apache.spark.sql.catalyst.util.SQLKeywordUtils @@ -232,14 +232,14 @@ private[hive] object SparkSQLCLIDriver extends Logging { val historyFile = historyDirectory + File.separator + ".hivehistory" reader.setHistory(new FileHistory(new File(historyFile))) } else { -logWarning("WARNING: Directory for Hive history file: " + historyDirectory + - " does not exist. H
(spark) branch master updated (9a1fc112677f -> 6919febfcc87)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 9a1fc112677f [SPARK-47871][SQL] Oracle: Map TimestampType to TIMESTAMP WITH LOCAL TIME ZONE add 6919febfcc87 [SPARK-47594] Connector module: Migrate logInfo with variables to structured logging framework No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/internal/LogKey.scala | 36 +- .../org/apache/spark/sql/avro/AvroUtils.scala | 7 +++-- .../execution/ExecuteGrpcResponseSender.scala | 33 +++- .../execution/ExecuteResponseObserver.scala| 19 +++- .../sql/connect/planner/SparkConnectPlanner.scala | 7 +++-- .../planner/StreamingForeachBatchHelper.scala | 20 .../planner/StreamingQueryListenerHelper.scala | 7 +++-- .../sql/connect/service/LoggingInterceptor.scala | 9 -- .../spark/sql/connect/service/SessionHolder.scala | 15 ++--- .../service/SparkConnectExecutionManager.scala | 17 ++ .../sql/connect/service/SparkConnectServer.scala | 7 +++-- .../sql/connect/service/SparkConnectService.scala | 5 +-- .../service/SparkConnectSessionManager.scala | 11 +-- .../service/SparkConnectStreamingQueryCache.scala | 26 +++- .../spark/sql/connect/utils/ErrorUtils.scala | 4 +-- .../sql/kafka010/KafkaBatchPartitionReader.scala | 14 ++--- .../spark/sql/kafka010/KafkaContinuousStream.scala | 4 +-- .../spark/sql/kafka010/KafkaMicroBatchStream.scala | 4 +-- .../sql/kafka010/KafkaOffsetReaderAdmin.scala | 4 +-- .../sql/kafka010/KafkaOffsetReaderConsumer.scala | 4 +-- .../apache/spark/sql/kafka010/KafkaRelation.scala | 7 +++-- .../org/apache/spark/sql/kafka010/KafkaSink.scala | 5 +-- .../apache/spark/sql/kafka010/KafkaSource.scala| 11 --- .../apache/spark/sql/kafka010/KafkaSourceRDD.scala | 6 ++-- .../sql/kafka010/consumer/KafkaDataConsumer.scala | 13 +--- .../kafka010/producer/CachedKafkaProducer.scala| 5 +-- .../apache/spark/sql/kafka010/KafkaTestUtils.scala | 10 +++--- .../kafka010/DirectKafkaInputDStream.scala | 9 -- .../streaming/kafka010/KafkaDataConsumer.scala | 18 ++- .../apache/spark/streaming/kafka010/KafkaRDD.scala | 12 +--- .../spark/streaming/kinesis/KinesisReceiver.scala | 7 +++-- .../streaming/kinesis/KinesisRecordProcessor.scala | 12 +--- .../executor/profiler/ExecutorJVMProfiler.scala| 5 +-- .../executor/profiler/ExecutorProfilerPlugin.scala | 6 ++-- .../scala/org/apache/spark/deploy/Client.scala | 6 ++-- .../spark/deploy/yarn/ApplicationMaster.scala | 4 +-- .../scheduler/cluster/YarnSchedulerBackend.scala | 6 ++-- 37 files changed, 257 insertions(+), 138 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47871][SQL] Oracle: Map TimestampType to TIMESTAMP WITH LOCAL TIME ZONE
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 9a1fc112677f [SPARK-47871][SQL] Oracle: Map TimestampType to TIMESTAMP WITH LOCAL TIME ZONE 9a1fc112677f is described below commit 9a1fc112677f98089d946b3bf4f52b33ab0a5c23 Author: Kent Yao AuthorDate: Tue Apr 16 08:35:51 2024 -0700 [SPARK-47871][SQL] Oracle: Map TimestampType to TIMESTAMP WITH LOCAL TIME ZONE ### What changes were proposed in this pull request? This PR map TimestampType to TIMESTAMP WITH LOCAL TIME ZONE ### Why are the changes needed? We currently map both TimestampType and TimestampNTZType to Oracle's TIMESTAMP which represents a timestamp without time zone. This is ambiguous ### Does this PR introduce _any_ user-facing change? It does not affect spark users to play a TimestampType read-write-read roundtrip, but might affect other systems' reading ### How was this patch tested? existing test with new configuration ```java SPARK-42627: Support ORACLE TIMESTAMP WITH LOCAL TIME ZONE (9 seconds, 536 milliseconds) ``` ### Was this patch authored or co-authored using generative AI tooling? no Closes #46080 from yaooqinn/SPARK-47871. Authored-by: Kent Yao Signed-off-by: Dongjoon Hyun --- .../spark/sql/jdbc/OracleIntegrationSuite.scala| 39 -- docs/sql-migration-guide.md| 1 + .../org/apache/spark/sql/internal/SQLConf.scala| 12 +++ .../org/apache/spark/sql/jdbc/OracleDialect.scala | 5 ++- .../org/apache/spark/sql/jdbc/JDBCSuite.scala | 5 ++- 5 files changed, 43 insertions(+), 19 deletions(-) diff --git a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala index 418b86fb6b23..496498e5455b 100644 --- a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala +++ b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala @@ -547,23 +547,28 @@ class OracleIntegrationSuite extends DockerJDBCIntegrationSuite with SharedSpark } test("SPARK-42627: Support ORACLE TIMESTAMP WITH LOCAL TIME ZONE") { -val reader = spark.read.format("jdbc") - .option("url", jdbcUrl) - .option("dbtable", "test_ltz") -val df = reader.load() -val row1 = df.collect().head.getTimestamp(0) -assert(df.count() === 1) -assert(row1 === Timestamp.valueOf("2018-11-17 13:33:33")) - -df.write.format("jdbc") - .option("url", jdbcUrl) - .option("dbtable", "test_ltz") - .mode("append") - .save() - -val df2 = reader.load() -assert(df.count() === 2) -assert(df2.collect().forall(_.getTimestamp(0) === row1)) +Seq("true", "false").foreach { flag => + withSQLConf((SQLConf.LEGACY_ORACLE_TIMESTAMP_MAPPING_ENABLED.key, flag)) { +val df = spark.read.format("jdbc") + .option("url", jdbcUrl) + .option("dbtable", "test_ltz") + .load() +val row1 = df.collect().head.getTimestamp(0) +assert(df.count() === 1) +assert(row1 === Timestamp.valueOf("2018-11-17 13:33:33")) + +df.write.format("jdbc") + .option("url", jdbcUrl) + .option("dbtable", "test_ltz" + flag) + .save() + +val df2 = spark.read.format("jdbc") + .option("url", jdbcUrl) + .option("dbtable", "test_ltz" + flag) + .load() +checkAnswer(df2, Row(row1)) + } +} } test("SPARK-47761: Reading ANSI INTERVAL Types") { diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md index c7bd0b55840c..3004008b8ec7 100644 --- a/docs/sql-migration-guide.md +++ b/docs/sql-migration-guide.md @@ -45,6 +45,7 @@ license: | - Since Spark 4.0, MySQL JDBC datasource will read FLOAT as FloatType, while in Spark 3.5 and previous, it was read as DoubleType. To restore the previous behavior, you can cast the column to the old type. - Since Spark 4.0, MySQL JDBC datasource will read BIT(n > 1) as BinaryType, while in Spark 3.5 and previous, read as LongType. To restore the previous behavior, set `spark.sql.legacy.mysql.bitArrayMapping.enabled` to `true`. - Since Spark 4.0, MySQL JDBC datasource will write ShortType as SMALLINT, while in Spark 3.5 and previous, write as INTEGER. To restore the previous behavior, you can replace the column with IntegerType whenever before writing. +- Since Spark 4.0, Oracle JDBC datasource will write TimestampType as TIMESTAMP WITH LOCAL TIME ZONE, while in Spark 3.5 and previous, write as TIMESTAMP. To restore the previous behavior,
(spark) branch master updated: [SPARK-47417][SQL] Collation support: Ascii, Chr, Base64, UnBase64, Decode, StringDecode, Encode, ToBinary, FormatNumber, Sentences
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ee2673f2e948 [SPARK-47417][SQL] Collation support: Ascii, Chr, Base64, UnBase64, Decode, StringDecode, Encode, ToBinary, FormatNumber, Sentences ee2673f2e948 is described below commit ee2673f2e94811022f6a3d9a03ad119f7a8e5d65 Author: Nikola Mandic AuthorDate: Tue Apr 16 23:09:23 2024 +0800 [SPARK-47417][SQL] Collation support: Ascii, Chr, Base64, UnBase64, Decode, StringDecode, Encode, ToBinary, FormatNumber, Sentences ### What changes were proposed in this pull request? `Chr` and `Base64` are skipped as they don't accept input string types and don't need to be updated. Other functions are updated to accept collated strings as inputs. ### Why are the changes needed? Add collations support in string functions. ### Does this PR introduce _any_ user-facing change? Yes, it changes behavior of string functions when string parameters have collation. ### How was this patch tested? Add checks to `CollationStringExpressionsSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45933 from nikolamand-db/SPARK-47417-47418-47420. Authored-by: Nikola Mandic Signed-off-by: Wenchen Fan --- .../catalyst/expressions/stringExpressions.scala | 40 +-- .../expressions/StringExpressionsSuite.scala | 3 +- .../sql/CollationStringExpressionsSuite.scala | 77 +- 3 files changed, 99 insertions(+), 21 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala index 4fe57b4f8f02..b3029302c03d 100755 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala @@ -2352,7 +2352,7 @@ case class Ascii(child: Expression) extends UnaryExpression with ImplicitCastInputTypes with NullIntolerant { override def dataType: DataType = IntegerType - override def inputTypes: Seq[DataType] = Seq(StringType) + override def inputTypes: Seq[AbstractDataType] = Seq(StringTypeAnyCollation) protected override def nullSafeEval(string: Any): Any = { // only pick the first character to reduce the `toString` cost @@ -2398,7 +2398,7 @@ case class Ascii(child: Expression) case class Chr(child: Expression) extends UnaryExpression with ImplicitCastInputTypes with NullIntolerant { - override def dataType: DataType = StringType + override def dataType: DataType = SQLConf.get.defaultStringType override def inputTypes: Seq[DataType] = Seq(LongType) protected override def nullSafeEval(lon: Any): Any = { @@ -2447,7 +2447,7 @@ case class Chr(child: Expression) case class Base64(child: Expression) extends UnaryExpression with ImplicitCastInputTypes with NullIntolerant { - override def dataType: DataType = StringType + override def dataType: DataType = SQLConf.get.defaultStringType override def inputTypes: Seq[DataType] = Seq(BinaryType) protected override def nullSafeEval(bytes: Any): Any = { @@ -2480,7 +2480,7 @@ case class UnBase64(child: Expression, failOnError: Boolean = false) extends UnaryExpression with ImplicitCastInputTypes with NullIntolerant { override def dataType: DataType = BinaryType - override def inputTypes: Seq[DataType] = Seq(StringType) + override def inputTypes: Seq[AbstractDataType] = Seq(StringTypeAnyCollation) def this(expr: Expression) = this(expr, false) @@ -2672,8 +2672,8 @@ case class StringDecode(bin: Expression, charset: Expression, legacyCharsets: Bo override def left: Expression = bin override def right: Expression = charset - override def dataType: DataType = StringType - override def inputTypes: Seq[DataType] = Seq(BinaryType, StringType) + override def dataType: DataType = SQLConf.get.defaultStringType + override def inputTypes: Seq[AbstractDataType] = Seq(BinaryType, StringTypeAnyCollation) private val supportedCharsets = Set( "US-ASCII", "ISO-8859-1", "UTF-8", "UTF-16BE", "UTF-16LE", "UTF-16") @@ -2750,7 +2750,8 @@ case class Encode(str: Expression, charset: Expression, legacyCharsets: Boolean) override def left: Expression = str override def right: Expression = charset override def dataType: DataType = BinaryType - override def inputTypes: Seq[DataType] = Seq(StringType, StringType) + override def inputTypes: Seq[AbstractDataType] = +Seq(StringTypeAnyCollation, StringTypeAnyCollation) private val supportedCharsets = Set( "US-ASCII", "ISO-8859-1", "UTF-8", "UTF-16BE", "UTF-
(spark-kubernetes-operator) branch main updated: [SPARK-47745] Add License to Spark Operator repository
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/spark-kubernetes-operator.git The following commit(s) were added to refs/heads/main by this push: new 7a3a7e8 [SPARK-47745] Add License to Spark Operator repository 7a3a7e8 is described below commit 7a3a7e882af2c8e8d463ebed71329212133d229c Author: zhou-jiang AuthorDate: Tue Apr 16 08:08:26 2024 -0700 [SPARK-47745] Add License to Spark Operator repository ### What changes were proposed in this pull request? This PR aims to add ASF license file. ### Why are the changes needed? To receive a code contribution. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #3 from jiangzho/license. Authored-by: zhou-jiang Signed-off-by: Dongjoon Hyun --- LICENSE | 201 1 file changed, 201 insertions(+) diff --git a/LICENSE b/LICENSE new file mode 100644 index 000..261eeb9 --- /dev/null +++ b/LICENSE @@ -0,0 +1,201 @@ + Apache License + Version 2.0, January 2004 +http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants t
(spark-kubernetes-operator) branch main updated: Update GITHUB_API_BASE
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/spark-kubernetes-operator.git The following commit(s) were added to refs/heads/main by this push: new a8eb690 Update GITHUB_API_BASE a8eb690 is described below commit a8eb690a7a85fd2b580e3756fad8d2bcf306e12c Author: Dongjoon Hyun AuthorDate: Tue Apr 16 08:06:10 2024 -0700 Update GITHUB_API_BASE --- dev/merge_spark_pr.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dev/merge_spark_pr.py b/dev/merge_spark_pr.py index 4647383..24e956d 100755 --- a/dev/merge_spark_pr.py +++ b/dev/merge_spark_pr.py @@ -65,7 +65,7 @@ GITHUB_OAUTH_KEY = os.environ.get("GITHUB_OAUTH_KEY") GITHUB_BASE = "https://github.com/apache/spark-kubernetes-operator/pull"; -GITHUB_API_BASE = "https://api.github.com/repos/spark-kubernetes-operator"; +GITHUB_API_BASE = "https://api.github.com/repos/apache/spark-kubernetes-operator"; JIRA_BASE = "https://issues.apache.org/jira/browse"; JIRA_API_BASE = "https://issues.apache.org/jira"; # Prefix added to temporary branches - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (4dad2170b05c -> 9b1b2b30d591)
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 4dad2170b05c [SPARK-47356][SQL] Add support for ConcatWs & Elt (all collations) add 9b1b2b30d591 [SPARK-47081][CONNECT][FOLLOW] Unflake Progress Execution No new revisions were added by this update. Summary of changes: .../sql/connect/execution/ExecuteGrpcResponseSender.scala | 2 -- .../sql/connect/execution/ExecuteResponseObserver.scala | 10 ++ .../spark/sql/connect/execution/ExecuteThreadRunner.scala | 5 +++-- python/pyspark/sql/connect/client/core.py | 10 +++--- python/pyspark/sql/tests/connect/shell/test_progress.py | 13 + 5 files changed, 29 insertions(+), 11 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47356][SQL] Add support for ConcatWs & Elt (all collations)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 4dad2170b05c [SPARK-47356][SQL] Add support for ConcatWs & Elt (all collations) 4dad2170b05c is described below commit 4dad2170b05c04faf1da550ab3fb8c52a61b8be7 Author: Mihailo Milosevic AuthorDate: Tue Apr 16 21:21:24 2024 +0800 [SPARK-47356][SQL] Add support for ConcatWs & Elt (all collations) ### What changes were proposed in this pull request? Addition of support for ConcatWs and Elt expressions. ### Why are the changes needed? We need to enable these functions to support collations in order to scope all functions. ### Does this PR introduce _any_ user-facing change? Yes, both expressions now will not return error when called with collated strings. ### How was this patch tested? Addition of tests to `CollationStringExpressionsSuite` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46061 from mihailom-db/SPARK-47356. Authored-by: Mihailo Milosevic Signed-off-by: Wenchen Fan --- .../sql/catalyst/analysis/CollationTypeCasts.scala | 5 ++- .../catalyst/expressions/stringExpressions.scala | 25 ++-- .../sql/CollationStringExpressionsSuite.scala | 46 -- 3 files changed, 51 insertions(+), 25 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala index 1a14b4227de8..795e8a696b01 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala @@ -22,7 +22,7 @@ import javax.annotation.Nullable import scala.annotation.tailrec import org.apache.spark.sql.catalyst.analysis.TypeCoercion.{hasStringType, haveSameType} -import org.apache.spark.sql.catalyst.expressions.{ArrayJoin, BinaryExpression, CaseWhen, Cast, Coalesce, Collate, Concat, ConcatWs, CreateArray, Expression, Greatest, If, In, InSubquery, Least} +import org.apache.spark.sql.catalyst.expressions.{ArrayJoin, BinaryExpression, CaseWhen, Cast, Coalesce, Collate, Concat, ConcatWs, CreateArray, Elt, Expression, Greatest, If, In, InSubquery, Least} import org.apache.spark.sql.errors.QueryCompilationErrors import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types.{ArrayType, DataType, StringType} @@ -45,6 +45,9 @@ object CollationTypeCasts extends TypeCoercionRule { caseWhenExpr.elseValue.map(e => castStringType(e, outputStringType).getOrElse(e)) CaseWhen(newBranches, newElseValue) +case eltExpr: Elt => + eltExpr.withNewChildren(eltExpr.children.head +: collateToSingleType(eltExpr.children.tail)) + case otherExpr @ ( _: In | _: InSubquery | _: CreateArray | _: ArrayJoin | _: Concat | _: Greatest | _: Least | _: Coalesce | _: BinaryExpression | _: ConcatWs) => diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala index 34e8f3f40859..4fe57b4f8f02 100755 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala @@ -37,7 +37,7 @@ import org.apache.spark.sql.catalyst.trees.TreePattern.{TreePattern, UPPER_OR_LO import org.apache.spark.sql.catalyst.util.{ArrayData, CollationSupport, GenericArrayData, TypeUtils} import org.apache.spark.sql.errors.{QueryCompilationErrors, QueryExecutionErrors} import org.apache.spark.sql.internal.SQLConf -import org.apache.spark.sql.internal.types.StringTypeAnyCollation +import org.apache.spark.sql.internal.types.{AbstractArrayType, StringTypeAnyCollation} import org.apache.spark.sql.types._ import org.apache.spark.unsafe.UTF8StringBuilder import org.apache.spark.unsafe.array.ByteArrayMethods @@ -79,11 +79,12 @@ case class ConcatWs(children: Seq[Expression]) /** The 1st child (separator) is str, and rest are either str or array of str. */ override def inputTypes: Seq[AbstractDataType] = { -val arrayOrStr = TypeCollection(ArrayType(StringType), StringType) -StringType +: Seq.fill(children.size - 1)(arrayOrStr) +val arrayOrStr = + TypeCollection(AbstractArrayType(StringTypeAnyCollation), StringTypeAnyCollation) +StringTypeAnyCollation +: Seq.fill(children.size - 1)(arrayOrStr) } - override def dataType: DataType = StringType + override def dataType: DataType = children.head.dataType override def nullabl
(spark) branch master updated: [SPARK-47739][SQL] Register logical avro type
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new fa2e9c7275aa [SPARK-47739][SQL] Register logical avro type fa2e9c7275aa is described below commit fa2e9c7275aa1c09652d0df0992565c32974b2b9 Author: milastdbx AuthorDate: Tue Apr 16 03:38:19 2024 -0700 [SPARK-47739][SQL] Register logical avro type ### What changes were proposed in this pull request? In this pull request I propose that we register logical avro types when we initialize `AvroUtils` and `AvroFileFormat`, otherwise for first schema discovery we might get wrong result on very first execution after spark starts. https://github.com/apache/spark/assets/150366084/3eaba6e3-34ec-4ca9-ae89-d0259ce942ba";> example ```scala val new_schema = """ | { | "type": "record", | "name": "Entry", | "fields": [ | { | "name": "rate", | "type": [ | "null", | { | "type": "long", | "logicalType": "custom-decimal", | "precision": 38, | "scale": 9 | } | ], | "default": null | } | ] | }""".stripMargin spark.read.format("avro").option("avroSchema", new_schema).load().printSchema // maps to long - WRONG spark.read.format("avro").option("avroSchema", new_schema).load().printSchema // maps to Decimal - CORRECT ``` ### Why are the changes needed? To fix issue with resolving avro schema upon spark startup. ### Does this PR introduce _any_ user-facing change? No, its a bugfix ### How was this patch tested? Unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #45895 from milastdbx/dev/milast/fixAvroLogicalTypeRegistration. Lead-authored-by: milastdbx Co-authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../org/apache/spark/sql/avro/AvroFileFormat.scala | 21 -- .../spark/sql/avro/AvroLogicalTypeInitSuite.scala | 76 ++ 2 files changed, 91 insertions(+), 6 deletions(-) diff --git a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala index 2792edaea284..372f24b54f5c 100755 --- a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala +++ b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala @@ -43,6 +43,8 @@ import org.apache.spark.util.SerializableConfiguration private[sql] class AvroFileFormat extends FileFormat with DataSourceRegister with Logging with Serializable { + AvroFileFormat.registerCustomAvroTypes() + override def equals(other: Any): Boolean = other match { case _: AvroFileFormat => true case _ => false @@ -173,10 +175,17 @@ private[sql] class AvroFileFormat extends FileFormat private[avro] object AvroFileFormat { val IgnoreFilesWithoutExtensionProperty = "avro.mapred.ignore.inputs.without.extension" - // Register the customized decimal type backed by long. - LogicalTypes.register(CustomDecimal.TYPE_NAME, new LogicalTypes.LogicalTypeFactory { -override def fromSchema(schema: Schema): LogicalType = { - new CustomDecimal(schema) -} - }) + /** + * Register Spark defined custom Avro types. + */ + def registerCustomAvroTypes(): Unit = { +// Register the customized decimal type backed by long. +LogicalTypes.register(CustomDecimal.TYPE_NAME, new LogicalTypes.LogicalTypeFactory { + override def fromSchema(schema: Schema): LogicalType = { +new CustomDecimal(schema) + } +}) + } + + registerCustomAvroTypes() } diff --git a/connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroLogicalTypeInitSuite.scala b/connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroLogicalTypeInitSuite.scala new file mode 100644 index ..126440ed69b8 --- /dev/null +++ b/connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroLogicalTypeInitSuite.scala @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed o
(spark) branch master updated: [SPARK-46574][BUILD] Upgrade maven plugin to latest version
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new b7a729bfd19c [SPARK-46574][BUILD] Upgrade maven plugin to latest version b7a729bfd19c is described below commit b7a729bfd19cfa7a06d208f3899d329e414d5598 Author: panbingkun AuthorDate: Tue Apr 16 03:30:12 2024 -0700 [SPARK-46574][BUILD] Upgrade maven plugin to latest version ### What changes were proposed in this pull request? ### Why are the changes needed? - `exec-maven-plugin` from `3.1.0` to `3.2.0` https://github.com/mojohaus/exec-maven-plugin/releases/tag/3.2.0 https://github.com/mojohaus/exec-maven-plugin/releases/tag/3.1.1 Bug Fixes: 1.Fix https://github.com/mojohaus/exec-maven-plugin/issues/158 - Fix non ascii character handling (https://github.com/mojohaus/exec-maven-plugin/pull/372) 2.[https://github.com/mojohaus/exec-maven-plugin/issues/323] exec arguments missing (https://github.com/mojohaus/exec-maven-plugin/pull/324) - `build-helper-maven-plugin` from `3.4.0` to `3.5.0` https://github.com/mojohaus/build-helper-maven-plugin/releases/tag/3.5.0 - `maven-compiler-plugin` from `3.12.1` to `3.13.0` https://github.com/apache/maven-compiler-plugin/releases/tag/maven-compiler-plugin-3.13.0 - `maven-jar-plugin` from `3.3.0` to `3.4.0` https://github.com/apache/maven-jar-plugin/releases/tag/maven-jar-plugin-3.4.0 [[MJAR-62]](https://issues.apache.org/jira/browse/MJAR-62) - Set Build-Jdk according to used toolchain (https://github.com/apache/maven-jar-plugin/pull/73) - `maven-source-plugin` from `3.3.0` to `3.3.1` https://github.com/apache/maven-source-plugin/releases/tag/maven-source-plugin-3.3.1 - `maven-assembly-plugin` from `3.6.0` to `3.7.1` https://github.com/apache/maven-assembly-plugin/releases/tag/maven-assembly-plugin-3.7.1 https://github.com/apache/maven-assembly-plugin/releases/tag/maven-assembly-plugin-3.7.0 Bug Fixes: 1.[[MASSEMBLY-967](https://issues.apache.org/jira/browse/MASSEMBLY-967)] - maven-assembly-plugin doesn't add target/class artifacts in generated jarfat but META-INF/MANIFEST.MF seems to be correct 2.[[MASSEMBLY-994](https://issues.apache.org/jira/browse/MASSEMBLY-994)] - Items from unpacked dependency are not refreshed 3.[[MASSEMBLY-998](https://issues.apache.org/jira/browse/MASSEMBLY-998)] - Transitive dependencies are not properly excluded as of 3.1.1 4.[[MASSEMBLY-1008](https://issues.apache.org/jira/browse/MASSEMBLY-1008)] - Assembly plugin handles scopes wrongly 5.[[MASSEMBLY-1020](https://issues.apache.org/jira/browse/MASSEMBLY-1020)] - Cannot invoke "java.io.File.isFile()" because "this.inputFile" is null 6.[[MASSEMBLY-1021](https://issues.apache.org/jira/browse/MASSEMBLY-1021)] - Nullpointer in assembly:single when upgrading to 3.7.0 7.[[MASSEMBLY-1022](https://issues.apache.org/jira/browse/MASSEMBLY-1022)] - Unresolved artifacts should be not processed - `cyclonedx-maven-plugin` from `2.7.9` to `2.8.0` https://github.com/CycloneDX/cyclonedx-maven-plugin/releases/tag/cyclonedx-maven-plugin-2.8.0 https://github.com/CycloneDX/cyclonedx-maven-plugin/releases/tag/cyclonedx-maven-plugin-2.7.11 https://github.com/CycloneDX/cyclonedx-maven-plugin/releases/tag/cyclonedx-maven-plugin-2.7.10 Bug Fixes: 1.check if configured schemaVersion is supported (https://github.com/CycloneDX/cyclonedx-maven-plugin/pull/479) 2.ignore bomGenerator.generate() call (https://github.com/CycloneDX/cyclonedx-maven-plugin/pull/376) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46043 from panbingkun/update_maven_plugins. Authored-by: panbingkun Signed-off-by: Dongjoon Hyun --- pom.xml | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/pom.xml b/pom.xml index 99b238aac1dc..bf8d4f1b417d 100644 --- a/pom.xml +++ b/pom.xml @@ -116,7 +116,7 @@ 17 ${java.version} 3.9.6 -3.1.0 +3.2.0 spark 9.6 2.0.13 @@ -2994,7 +2994,7 @@ org.codehaus.mojo build-helper-maven-plugin - 3.4.0 + 3.5.0 module-timestamp-property @@ -3108,7 +3108,7 @@ org.apache.maven.plugins maven-compiler-plugin - 3.12.1 + 3.13.0 ${java.version} true @@ -3234,7 +3234,7 @@ org.apache.maven.plugins maven-jar-plugin - 3.3.0 + 3.4.0 org.apache.maven.plugins @@ -3244,7 +324
(spark) branch master updated: [SPARK-47818][CONNECT] Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a1fc6d57b27d [SPARK-47818][CONNECT] Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests a1fc6d57b27d is described below commit a1fc6d57b27d24b832b2f2580e6acd64c4488c62 Author: Xi Lyu AuthorDate: Tue Apr 16 16:27:32 2024 +0900 [SPARK-47818][CONNECT] Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests ### What changes were proposed in this pull request? While building the DataFrame step by step, each time a new DataFrame is generated with an empty schema, which is lazily computed on access. However, if a user's code frequently accesses the schema of these new DataFrames using methods such as `df.columns`, it will result in a large number of Analyze requests to the server. Each time, the entire plan needs to be reanalyzed, leading to poor performance, especially when constructing highly complex plans. Now, by introducing plan cache in SparkConnectPlanner, we aim to reduce the overhead of repeated analysis during this process. This is achieved by saving significant computation if the resolved logical plan of a subtree of can be cached. A minimal example of the problem: ``` import pyspark.sql.functions as F df = spark.range(10) for i in range(200): if str(i) not in df.columns: # <-- The df.columns call causes a new Analyze request in every iteration df = df.withColumn(str(i), F.col("id") + i) df.show() ``` With this patch, the performance of the above code improved from ~110s to ~5s. ### Why are the changes needed? The performance improvement is huge in the above cases. ### Does this PR introduce _any_ user-facing change? Yes, a static conf `spark.connect.session.planCache.maxSize` and a dynamic conf `spark.connect.session.planCache.enabled` are added. * `spark.connect.session.planCache.maxSize`: Sets the maximum number of cached resolved logical plans in Spark Connect Session. If set to a value less or equal than zero will disable the plan cache * `spark.connect.session.planCache.enabled`: When true, the cache of resolved logical plans is enabled if `spark.connect.session.planCache.maxSize` is greater than zero. When false, the cache is disabled even if `spark.connect.session.planCache.maxSize` is greater than zero. The caching is best-effort and not guaranteed. ### How was this patch tested? Some new tests are added in SparkConnectSessionHolderSuite.scala. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46012 from xi-db/SPARK-47818-plan-cache. Lead-authored-by: Xi Lyu Co-authored-by: Xi Lyu <159039256+xi...@users.noreply.github.com> Signed-off-by: Hyukjin Kwon --- .../apache/spark/sql/connect/config/Connect.scala | 18 ++ .../sql/connect/planner/SparkConnectPlanner.scala | 201 - .../spark/sql/connect/service/SessionHolder.scala | 79 +++- .../service/SparkConnectAnalyzeHandler.scala | 26 +-- .../service/SparkConnectSessionHolderSuite.scala | 125 - 5 files changed, 345 insertions(+), 104 deletions(-) diff --git a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/config/Connect.scala b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/config/Connect.scala index 6ba100af1bb9..e94e86587393 100644 --- a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/config/Connect.scala +++ b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/config/Connect.scala @@ -273,4 +273,22 @@ object Connect { .version("4.0.0") .timeConf(TimeUnit.MILLISECONDS) .createWithDefaultString("2s") + + val CONNECT_SESSION_PLAN_CACHE_SIZE = +buildStaticConf("spark.connect.session.planCache.maxSize") + .doc("Sets the maximum number of cached resolved logical plans in Spark Connect Session." + +" If set to a value less or equal than zero will disable the plan cache.") + .version("4.0.0") + .intConf + .createWithDefault(5) + + val CONNECT_SESSION_PLAN_CACHE_ENABLED = +buildConf("spark.connect.session.planCache.enabled") + .doc("When true, the cache of resolved logical plans is enabled if" + +s" '${CONNECT_SESSION_PLAN_CACHE_SIZE.key}' is greater than zero." + +s" When false, the cache is disabled even if '${CONNECT_SESSION_PLAN_CACHE_SIZE.key}' is" + +" greater than zero. The caching is best-effort and not guaranteed.") + .version("4.0.0") + .booleanConf + .createWithDefault(true) } diff --git a/connector/connect/server/src/main/scala/o