date:20240416

(spark) branch master updated: [SPARK-46812][CONNECT][PYTHON][FOLLOW-UP] Add pyspark.pyspark.sql.connect.resource into PyPi packaging

2024-04-16 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8c2c0fb43a9d [SPARK-46812][CONNECT][PYTHON][FOLLOW-UP] Add 
pyspark.pyspark.sql.connect.resource into PyPi packaging
8c2c0fb43a9d is described below

commit 8c2c0fb43a9d3c5bffcf33aeb3354c01fe6b26cd
Author: Hyukjin Kwon 
AuthorDate: Wed Apr 17 15:27:23 2024 +0900

[SPARK-46812][CONNECT][PYTHON][FOLLOW-UP] Add 
pyspark.pyspark.sql.connect.resource into PyPi packaging

### What changes were proposed in this pull request?

This PR proposes to add `pyspark.pyspark.sql.connect.resource` into PyPi 
packaging.

### Why are the changes needed?

In order for PyPI end users to download PySpark and leverage this feature.

### Does this PR introduce _any_ user-facing change?

No, the main change has not been released.

### How was this patch tested?

Being tested at https://github.com/apache/spark/pull/46090

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46094 from HyukjinKwon/SPARK-46812-followup.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 python/packaging/classic/setup.py | 1 +
 python/packaging/connect/setup.py | 1 +
 2 files changed, 2 insertions(+)

diff --git a/python/packaging/classic/setup.py 
b/python/packaging/classic/setup.py
index 8eefc17db700..f900fa6e15ee 100755
--- a/python/packaging/classic/setup.py
+++ b/python/packaging/classic/setup.py
@@ -276,6 +276,7 @@ try:
 "pyspark.sql.connect.functions",
 "pyspark.sql.connect.proto",
 "pyspark.sql.connect.protobuf",
+"pyspark.sql.connect.resource",
 "pyspark.sql.connect.shell",
 "pyspark.sql.connect.streaming",
 "pyspark.sql.connect.streaming.worker",
diff --git a/python/packaging/connect/setup.py 
b/python/packaging/connect/setup.py
index fe1e7486faa9..19925962804b 100755
--- a/python/packaging/connect/setup.py
+++ b/python/packaging/connect/setup.py
@@ -145,6 +145,7 @@ try:
 "pyspark.sql.connect.functions",
 "pyspark.sql.connect.proto",
 "pyspark.sql.connect.protobuf",
+"pyspark.sql.connect.resource",
 "pyspark.sql.connect.shell",
 "pyspark.sql.connect.streaming",
 "pyspark.sql.connect.streaming.worker",


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-47588][CORE] Hive module: Migrate logInfo with variables to structured logging framework

2024-04-16 Thread gengliang

This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f77495909b29 [SPARK-47588][CORE] Hive module: Migrate logInfo with 
variables to structured logging framework
f77495909b29 is described below

commit f77495909b29fe4883afcfd8fec7be048fe494a3
Author: Gengliang Wang 
AuthorDate: Tue Apr 16 22:32:34 2024 -0700

[SPARK-47588][CORE] Hive module: Migrate logInfo with variables to 
structured logging framework

### What changes were proposed in this pull request?

Migrate logInfo in Hive module with variables to structured logging 
framework.

### Why are the changes needed?

To enhance Apache Spark's logging system by implementing structured logging.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

GA tests
### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46086 from gengliangwang/hive_loginfo.

Authored-by: Gengliang Wang 
Signed-off-by: Gengliang Wang 
---
 .../scala/org/apache/spark/internal/LogKey.scala   |  4 +++
 .../spark/sql/hive/HiveExternalCatalog.scala   | 30 --
 .../spark/sql/hive/HiveMetastoreCatalog.scala  |  9 ---
 .../org/apache/spark/sql/hive/HiveUtils.scala  | 27 +++
 .../spark/sql/hive/client/HiveClientImpl.scala |  5 ++--
 .../sql/hive/client/IsolatedClientLoader.scala |  4 +--
 .../spark/sql/hive/orc/OrcFileOperator.scala   |  4 +--
 7 files changed, 48 insertions(+), 35 deletions(-)

diff --git a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala 
b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala
index bfeb733af30a..838ef0355e3a 100644
--- a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala
+++ b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala
@@ -95,10 +95,13 @@ object LogKey extends Enumeration {
   val GROUP_ID = Value
   val HADOOP_VERSION = Value
   val HISTORY_DIR = Value
+  val HIVE_CLIENT_VERSION = Value
+  val HIVE_METASTORE_VERSION = Value
   val HIVE_OPERATION_STATE = Value
   val HIVE_OPERATION_TYPE = Value
   val HOST = Value
   val HOST_PORT = Value
+  val INCOMPATIBLE_TYPES = Value
   val INDEX = Value
   val INFERENCE_MODE = Value
   val INITIAL_CAPACITY = Value
@@ -152,6 +155,7 @@ object LogKey extends Enumeration {
   val POLICY = Value
   val PORT = Value
   val PRODUCER_ID = Value
+  val PROVIDER = Value
   val QUERY_CACHE_VALUE = Value
   val QUERY_HINT = Value
   val QUERY_ID = Value
diff --git 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
index 8c35e10b383f..60f2d2f3e5fe 100644
--- 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
+++ 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
@@ -34,7 +34,7 @@ import org.apache.thrift.TException
 
 import org.apache.spark.{SparkConf, SparkException}
 import org.apache.spark.internal.{Logging, MDC}
-import org.apache.spark.internal.LogKey.{DATABASE_NAME, SCHEMA, SCHEMA2, 
TABLE_NAME}
+import org.apache.spark.internal.LogKey.{DATABASE_NAME, INCOMPATIBLE_TYPES, 
PROVIDER, SCHEMA, SCHEMA2, TABLE_NAME}
 import org.apache.spark.sql.AnalysisException
 import org.apache.spark.sql.catalyst.TableIdentifier
 import org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException
@@ -338,35 +338,37 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, 
hadoopConf: Configurat
 val (hiveCompatibleTable, logMessage) = maybeSerde match {
   case _ if options.skipHiveMetadata =>
 val message =
-  s"Persisting data source table $qualifiedTableName into Hive 
metastore in" +
-"Spark SQL specific format, which is NOT compatible with Hive."
+  log"Persisting data source table ${MDC(TABLE_NAME, 
qualifiedTableName)} into Hive " +
+log"metastore in Spark SQL specific format, which is NOT 
compatible with Hive."
 (None, message)
 
   case _ if incompatibleTypes.nonEmpty =>
+val incompatibleTypesStr = incompatibleTypes.mkString(", ")
 val message =
-  s"Hive incompatible types found: ${incompatibleTypes.mkString(", 
")}. " +
-s"Persisting data source table $qualifiedTableName into Hive 
metastore in " +
-"Spark SQL specific format, which is NOT compatible with Hive."
+  log"Hive incompatible types found: ${MDC(INCOMPATIBLE_TYPES, 
incompatibleTypesStr)}. " +
+log"Persisting data source table ${MDC(TABLE_NAME, 
qualifiedTableName)} into Hive " +
+log"metastore in Spark SQL specific format, which is NOT 
compatible with Hive."

(spark) branch master updated (268856da31c1 -> 2054ab0fb03f)

2024-04-16 Thread yao

This is an automated email from the ASF dual-hosted git repository.

yao pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 268856da31c1 [SPARK-47879][SQL] Oracle: Use VARCHAR2 instead of 
VARCHAR for VarcharType mapping
 add 2054ab0fb03f [SPARK-47880][SQL][DOCS] Oracle: Document Mapping Spark 
SQL Data Types to Oracle

No new revisions were added by this update.

Summary of changes:
 docs/sql-data-sources-jdbc.md | 106 ++
 1 file changed, 106 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (ab6338e09aa0 -> 268856da31c1)

2024-04-16 Thread yao

This is an automated email from the ASF dual-hosted git repository.

yao pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from ab6338e09aa0 [SPARK-47838][BUILD] Upgrade `rocksdbjni` to 8.11.4
 add 268856da31c1 [SPARK-47879][SQL] Oracle: Use VARCHAR2 instead of 
VARCHAR for VarcharType mapping

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/sql/jdbc/v2/OracleIntegrationSuite.scala | 11 ++-
 .../main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala  |  1 +
 .../src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala  |  1 +
 3 files changed, 12 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-47838][BUILD] Upgrade `rocksdbjni` to 8.11.4

2024-04-16 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ab6338e09aa0 [SPARK-47838][BUILD] Upgrade `rocksdbjni` to 8.11.4
ab6338e09aa0 is described below

commit ab6338e09aa0fe06aef1c753eaaf677f766e9490
Author: Neil Ramaswamy 
AuthorDate: Tue Apr 16 20:11:16 2024 -0700

[SPARK-47838][BUILD] Upgrade `rocksdbjni` to 8.11.4

### What changes were proposed in this pull request?

Upgrades `rocksdbjni` dependency to 8.11.4.

### Why are the changes needed?

8.11.4 has Java-related RocksDB fixes:

https://github.com/facebook/rocksdb/releases/tag/v8.11.4

- Fixed CMake Javadoc build
- Fixed Java SstFileMetaData to prevent throwing java.lang.NoSuchMethodError

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- All existing UTs should pass
- [In progress] Performance benchmarks

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46065 from neilramaswamy/spark-47838.

Authored-by: Neil Ramaswamy 
Signed-off-by: Dongjoon Hyun 
---
 dev/deps/spark-deps-hadoop-3-hive-2.3  |   2 +-
 pom.xml|   2 +-
 ...StoreBasicOperationsBenchmark-jdk21-results.txt | 122 +++--
 .../StateStoreBasicOperationsBenchmark-results.txt | 122 +++--
 4 files changed, 126 insertions(+), 122 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index 466e8d09d89e..54e54a108904 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -247,7 +247,7 @@ parquet-jackson/1.13.1//parquet-jackson-1.13.1.jar
 pickle/1.3//pickle-1.3.jar
 py4j/0.10.9.7//py4j-0.10.9.7.jar
 remotetea-oncrpc/1.1.2//remotetea-oncrpc-1.1.2.jar
-rocksdbjni/8.11.3//rocksdbjni-8.11.3.jar
+rocksdbjni/8.11.4//rocksdbjni-8.11.4.jar
 scala-collection-compat_2.13/2.7.0//scala-collection-compat_2.13-2.7.0.jar
 scala-compiler/2.13.13//scala-compiler-2.13.13.jar
 scala-library/2.13.13//scala-library-2.13.13.jar
diff --git a/pom.xml b/pom.xml
index bf8d4f1b417d..7ded74b9f9df 100644
--- a/pom.xml
+++ b/pom.xml
@@ -687,7 +687,7 @@
   
 org.rocksdb
 rocksdbjni
-8.11.3
+8.11.4
   
   
 ${leveldbjni.group}
diff --git 
a/sql/core/benchmarks/StateStoreBasicOperationsBenchmark-jdk21-results.txt 
b/sql/core/benchmarks/StateStoreBasicOperationsBenchmark-jdk21-results.txt
index 0317e6116375..953031fc1daf 100644
--- a/sql/core/benchmarks/StateStoreBasicOperationsBenchmark-jdk21-results.txt
+++ b/sql/core/benchmarks/StateStoreBasicOperationsBenchmark-jdk21-results.txt
@@ -2,141 +2,143 @@
 put rows
 

 
-OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 6.5.0-1016-azure
+OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 6.5.0-1017-azure
 AMD EPYC 7763 64-Core Processor
 putting 1 rows (1 rows to overwrite - rate 100):  Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
---
-In-memory9 
10   1  1.1 936.2   1.0X
-RocksDB (trackTotalNumberOfRows: true)  41 
42   1  0.24068.9   0.2X
-RocksDB (trackTotalNumberOfRows: false) 15 
16   1  0.71500.4   0.6X
+In-memory9 
10   1  1.1 938.9   1.0X
+RocksDB (trackTotalNumberOfRows: true)  42 
44   2  0.24215.2   0.2X
+RocksDB (trackTotalNumberOfRows: false) 15 
16   1  0.71535.3   0.6X
 
-OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 6.5.0-1016-azure
+OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 6.5.0-1017-azure
 AMD EPYC 7763 64-Core Processor
 putting 1 rows (5000 rows to overwrite - rate 50):  Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
-
-In-memory  9   
  11   1  1.1 929.8   1.0X
-RocksDB (trackTotalNumberOfRows: true)40

(spark) branch master updated (f9ebe1b3d24b -> 6c827c10dc15)

2024-04-16 Thread xinrong

This is an automated email from the ASF dual-hosted git repository.

xinrong pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from f9ebe1b3d24b [SPARK-46375][DOCS] Add user guide for Python data source 
API
 add 6c827c10dc15 [SPARK-47876][PYTHON][DOCS] Improve docstring of 
mapInArrow

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/pandas/map_ops.py | 19 +--
 1 file changed, 9 insertions(+), 10 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-46375][DOCS] Add user guide for Python data source API

2024-04-16 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f9ebe1b3d24b [SPARK-46375][DOCS] Add user guide for Python data source 
API
f9ebe1b3d24b is described below

commit f9ebe1b3d24b126784b3bb65d1eb710a74cf63de
Author: allisonwang-db 
AuthorDate: Wed Apr 17 09:54:42 2024 +0900

[SPARK-46375][DOCS] Add user guide for Python data source API

### What changes were proposed in this pull request?

This PR adds a new user guide for the Python data source API with a simple 
example. More examples (including streaming) will be added in the future.

### Why are the changes needed?

To improve the documentation

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

doctest

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46089 from allisonwang-db/spark-46375-pyds-user-guide.

Authored-by: allisonwang-db 
Signed-off-by: Hyukjin Kwon 
---
 python/docs/source/user_guide/sql/index.rst|   1 +
 .../source/user_guide/sql/python_data_source.rst   | 139 +
 2 files changed, 140 insertions(+)

diff --git a/python/docs/source/user_guide/sql/index.rst 
b/python/docs/source/user_guide/sql/index.rst
index 118cf139d9b3..d1b67f7eeb90 100644
--- a/python/docs/source/user_guide/sql/index.rst
+++ b/python/docs/source/user_guide/sql/index.rst
@@ -25,5 +25,6 @@ Spark SQL
 
arrow_pandas
python_udtf
+   python_data_source
type_conversions
 
diff --git a/python/docs/source/user_guide/sql/python_data_source.rst 
b/python/docs/source/user_guide/sql/python_data_source.rst
new file mode 100644
index ..19ed016b82c2
--- /dev/null
+++ b/python/docs/source/user_guide/sql/python_data_source.rst
@@ -0,0 +1,139 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+==
+Python Data Source API
+==
+
+.. currentmodule:: pyspark.sql
+
+Overview
+
+The Python Data Source API is a new feature introduced in Spark 4.0, enabling 
developers to read from custom data sources and write to custom data sinks in 
Python.
+This guide provides a comprehensive overview of the API and instructions on 
how to create, use, and manage Python data sources.
+
+
+Creating a Python Data Source
+-
+To create a custom Python data source, you'll need to subclass the 
:class:`DataSource` base classes and implement the necessary methods for 
reading and writing data.
+
+This example demonstrates creating a simple data source to generate synthetic 
data using the `faker` library. Ensure the `faker` library is installed and 
accessible in your Python environment.
+
+**Step 1: Define the Data Source**
+
+Start by creating a new subclass of :class:`DataSource`. Define the source 
name, schema, and reader logic as follows:
+
+.. code-block:: python
+
+from pyspark.sql.datasource import DataSource, DataSourceReader
+from pyspark.sql.types import StructType
+
+class FakeDataSource(DataSource):
+"""
+A fake data source for PySpark to generate synthetic data using the 
`faker` library.
+Options:
+- numRows: specify number of rows to generate. Default value is 3.
+"""
+
+@classmethod
+def name(cls):
+return "fake"
+
+def schema(self):
+return "name string, date string, zipcode string, state string"
+
+def reader(self, schema: StructType):
+return FakeDataSourceReader(schema, self.options)
+
+
+**Step 2: Implement the Reader**
+
+Define the reader logic to generate synthetic data. Use the `faker` library to 
populate each field in the schema.
+
+.. code-block:: python
+
+class FakeDataSourceReader(DataSourceReader):
+
+def __init__(self, schema, options):
+self.schema: StructType = schema
+self.options = options
+
+def read(self, partition):
+from faker import Fa

(spark) branch master updated (5321353b24db -> 86837d3155b1)

2024-04-16 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 5321353b24db [SPARK-47875][CORE] Remove 
`spark.deploy.recoverySerializer`
 add 86837d3155b1 [SPARK-47877][SS][CONNECT] Speed up test_parity_listener

No new revisions were added by this update.

Summary of changes:
 .../connect/streaming/test_parity_listener.py  | 119 +++--
 .../sql/tests/streaming/test_streaming_listener.py |  21 ++--
 2 files changed, 71 insertions(+), 69 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-47875][CORE] Remove `spark.deploy.recoverySerializer`

2024-04-16 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 5321353b24db [SPARK-47875][CORE] Remove 
`spark.deploy.recoverySerializer`
5321353b24db is described below

commit 5321353b24db247087890c44de06b9ad4e136473
Author: Dongjoon Hyun 
AuthorDate: Tue Apr 16 16:47:23 2024 -0700

[SPARK-47875][CORE] Remove `spark.deploy.recoverySerializer`

### What changes were proposed in this pull request?

This is a logical revert of SPARK-46205
- #44113
- #44118

### Why are the changes needed?

The initial implementation didn't handle the class initialization logic 
properly.
Until we have a fix, I'd like to revert this from `master` branch.

### Does this PR introduce _any_ user-facing change?

No, this is not released yet.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46087 from dongjoon-hyun/SPARK-47875.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../PersistenceEngineBenchmark-jdk21-results.txt   |  7 --
 .../PersistenceEngineBenchmark-results.txt |  7 --
 .../org/apache/spark/deploy/master/Master.scala|  7 ++
 .../org/apache/spark/internal/config/Deploy.scala  | 14 
 .../deploy/master/PersistenceEngineBenchmark.scala |  4 ++--
 .../deploy/master/PersistenceEngineSuite.scala | 14 +---
 .../apache/spark/deploy/master/RecoverySuite.scala | 25 ++
 docs/spark-standalone.md   | 12 ++-
 8 files changed, 9 insertions(+), 81 deletions(-)

diff --git a/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt 
b/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt
index 2a6bd778fc8a..ae4e0071adb0 100644
--- a/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt
+++ b/core/benchmarks/PersistenceEngineBenchmark-jdk21-results.txt
@@ -7,19 +7,12 @@ AMD EPYC 7763 64-Core Processor
 1000 Workers: Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

 ZooKeeperPersistenceEngine with JavaSerializer 5036
   5232 229  0.0 5035730.1   1.0X
-ZooKeeperPersistenceEngine with KryoSerializer 4038
   4053  16  0.0 4038447.8   1.2X
 FileSystemPersistenceEngine with JavaSerializer2902
   2906   5  0.0 2902453.3   1.7X
 FileSystemPersistenceEngine with JavaSerializer (lz4)   816
829  19  0.0  816173.1   6.2X
 FileSystemPersistenceEngine with JavaSerializer (lzf)   755
780  33  0.0  755209.0   6.7X
 FileSystemPersistenceEngine with JavaSerializer (snappy)814
832  16  0.0  813672.5   6.2X
 FileSystemPersistenceEngine with JavaSerializer (zstd)  987
   1014  45  0.0  986834.7   5.1X
-FileSystemPersistenceEngine with KryoSerializer 687
698  14  0.0  687313.5   7.3X
-FileSystemPersistenceEngine with KryoSerializer (lz4)   590
599  15  0.0  589867.9   8.5X
-FileSystemPersistenceEngine with KryoSerializer (lzf)   915
922   9  0.0  915432.2   5.5X
-FileSystemPersistenceEngine with KryoSerializer (snappy)768
795  37  0.0  768494.4   6.6X
-FileSystemPersistenceEngine with KryoSerializer (zstd)  898
950  45  0.0  898118.6   5.6X
 RocksDBPersistenceEngine with JavaSerializer299
299   0  0.0  298800.0  16.9X
-RocksDBPersistenceEngine with KryoSerializer112
113   1  0.0  111779.6  45.1X
 BlackHolePersistenceEngine0
  0   0  5.5 180.3   27924.2X
 
 
diff --git a/core/benchmarks/PersistenceEngineBenchmark-results.txt 
b/core/benchmarks/PersistenceEngineBenchmark-results.txt
index da1838608de1..ec9a6fc1c8cf 100644
--- a/core/benchmarks/PersistenceEngineBenchmark-results.txt
+++ b/core/benchmarks/PersistenceEngineBenchmark-results.txt
@@ -7,19 +7,12 @@ AMD EPYC 7763 64-Core Processor
 1000 Workers: Best Time

(spark) branch master updated: [SPARK-47760][SPARK-47763][CONNECT][TESTS] Reeanble Avro and Protobuf function doctests

2024-04-16 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 57c7db2c4c1d [SPARK-47760][SPARK-47763][CONNECT][TESTS] Reeanble Avro 
and Protobuf function doctests
57c7db2c4c1d is described below

commit 57c7db2c4c1dbeeba062fe28ab58245e0a3098eb
Author: Hyukjin Kwon 
AuthorDate: Wed Apr 17 08:47:01 2024 +0900

[SPARK-47760][SPARK-47763][CONNECT][TESTS] Reeanble Avro and Protobuf 
function doctests

### What changes were proposed in this pull request?

This PR proposes to reeanble Avro and Protobuf function doctests by 
providing the required jars into Spark Connect server.

### Why are the changes needed?

For test coverages of Avro and Protobuf functions.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Tested in my fork: 
https://github.com/HyukjinKwon/spark/actions/runs/8704014674/job/23871383802

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46055 from HyukjinKwon/SPARK-47763-SPARK-47760.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 .github/workflows/build_python_connect.yml   | 11 ++-
 python/pyspark/sql/connect/avro/functions.py |  7 ---
 python/pyspark/sql/connect/protobuf/functions.py |  7 ---
 3 files changed, 6 insertions(+), 19 deletions(-)

diff --git a/.github/workflows/build_python_connect.yml 
b/.github/workflows/build_python_connect.yml
index 965e839b6b2b..863980b0c2e5 100644
--- a/.github/workflows/build_python_connect.yml
+++ b/.github/workflows/build_python_connect.yml
@@ -29,7 +29,6 @@ jobs:
 name: "Build modules: pyspark-connect"
 runs-on: ubuntu-latest
 timeout-minutes: 300
-if: github.repository == 'apache/spark'
 steps:
   - name: Checkout Spark repository
 uses: actions/checkout@v4
@@ -63,7 +62,7 @@ jobs:
   architecture: x64
   - name: Build Spark
 run: |
-  ./build/sbt -Phive test:package
+  ./build/sbt -Phive Test/package
   - name: Install pure Python package (pyspark-connect)
 env:
   SPARK_TESTING: 1
@@ -82,7 +81,9 @@ jobs:
   cp conf/log4j2.properties.template conf/log4j2.properties
   sed -i 's/rootLogger.level = info/rootLogger.level = warn/g' 
conf/log4j2.properties
   # Start a Spark Connect server
-  
PYTHONPATH="python/lib/pyspark.zip:python/lib/py4j-0.10.9.7-src.zip:$PYTHONPATH"
 ./sbin/start-connect-server.sh --driver-java-options 
"-Dlog4j.configurationFile=file:$GITHUB_WORKSPACE/conf/log4j2.properties" 
--jars `find connector/connect/server/target -name spark-connect*SNAPSHOT.jar`
+  
PYTHONPATH="python/lib/pyspark.zip:python/lib/py4j-0.10.9.7-src.zip:$PYTHONPATH"
 ./sbin/start-connect-server.sh \
+--driver-java-options 
"-Dlog4j.configurationFile=file:$GITHUB_WORKSPACE/conf/log4j2.properties" \
+--jars "`find connector/connect/server/target -name 
spark-connect-*SNAPSHOT.jar`,`find connector/protobuf/target -name 
spark-protobuf-*SNAPSHOT.jar`,`find connector/avro/target -name 
spark-avro*SNAPSHOT.jar`"
   # Make sure running Python workers that contains pyspark.core once. 
They will be reused.
   python -c "from pyspark.sql import SparkSession; _ = 
SparkSession.builder.remote('sc://localhost').getOrCreate().range(100).repartition(100).mapInPandas(lambda
 x: x, 'id INT').collect()"
   # Remove Py4J and PySpark zipped library to make sure there is no 
JVM connection
@@ -98,9 +99,9 @@ jobs:
 with:
   name: test-results-spark-connect-python-only
   path: "**/target/test-reports/*.xml"
-  - name: Upload unit tests log files
+  - name: Upload Spark Connect server log file
 if: failure()
 uses: actions/upload-artifact@v4
 with:
   name: unit-tests-log-spark-connect-python-only
-  path: "**/target/unit-tests.log"
+  path: logs/*.out
diff --git a/python/pyspark/sql/connect/avro/functions.py 
b/python/pyspark/sql/connect/avro/functions.py
index 43088333b108..f153b17acf58 100644
--- a/python/pyspark/sql/connect/avro/functions.py
+++ b/python/pyspark/sql/connect/avro/functions.py
@@ -80,15 +80,8 @@ def _test() -> None:
 import doctest
 from pyspark.sql import SparkSession as PySparkSession
 import pyspark.sql.connect.avro.functions
-from pyspark.util import is_remote_only
 
 globs = pyspark.sql.connect.avro.functions.__dict__.copy()
-
-# TODO(SPARK-47760): Reeanble Avro function doctests
-if is_remote_only():
-del pyspark.sql.connect.avro.functions.from_avro
-del pyspark.sql.connect.avro.functions.to_avro
-
 globs["spark"] = (
 PySparkSession.buil

(spark) branch master updated: [SPARK-47868][CONNECT] Fix recursion limit error in SparkConnectPlanner and SparkSession

2024-04-16 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 119271d49356 [SPARK-47868][CONNECT] Fix recursion limit error in 
SparkConnectPlanner and SparkSession
119271d49356 is described below

commit 119271d4935605f15c358c52410dc20db40ace86
Author: Tom van Bussel 
AuthorDate: Wed Apr 17 07:33:24 2024 +0800

[SPARK-47868][CONNECT] Fix recursion limit error in SparkConnectPlanner and 
SparkSession



### What changes were proposed in this pull request?

This PR adds a helper function to `ProtoUtils` that calls a proto parser on 
a byte array with an increased recursion limit. This helper function is used to 
enhance the `parseFrom` calls in `SparkSession` and `SparkConnectPlanner`.

### Why are the changes needed?

Otherwise Spark Connect extensions will get the following exception:
```
org.apache.spark.SparkException: 
grpc_shaded.com.google.protobuf.InvalidProtocolBufferException: Protocol 
message had too many levels of nesting.  May be malicious.  Use 
CodedInputStream.setRecursionLimit() to increase the depth limit.
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests to make sure nothing breaks. Manually tested (using an 
extension that is currently in development) that this solves the exceptions.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46075 from tomvanbussel/SPARK-47868.

Authored-by: Tom van Bussel 
Signed-off-by: Ruifeng Zheng 
---
 .../scala/org/apache/spark/sql/SparkSession.scala  |  7 +-
 .../sql/connect/client/SparkConnectClient.scala| 10 -
 .../spark/sql/connect/common/ProtoUtils.scala  | 25 --
 .../sql/connect/common/config/ConnectCommon.scala  |  3 ++-
 .../apache/spark/sql/connect/config/Connect.scala  |  2 +-
 .../sql/connect/planner/SparkConnectPlanner.scala  | 14 
 6 files changed, 51 insertions(+), 10 deletions(-)

diff --git 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala
 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala
index 5a2d9bc44c9f..54e29c80c728 100644
--- 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala
+++ 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala
@@ -39,6 +39,7 @@ import 
org.apache.spark.sql.catalyst.encoders.AgnosticEncoders.{BoxedLongEncoder
 import org.apache.spark.sql.connect.client.{ClassFinder, SparkConnectClient, 
SparkResult}
 import org.apache.spark.sql.connect.client.SparkConnectClient.Configuration
 import org.apache.spark.sql.connect.client.arrow.ArrowSerializer
+import org.apache.spark.sql.connect.common.ProtoUtils
 import org.apache.spark.sql.functions.lit
 import org.apache.spark.sql.internal.{CatalogImpl, SqlApiConf}
 import org.apache.spark.sql.streaming.DataStreamReader
@@ -586,9 +587,13 @@ class SparkSession private[sql] (
 
   @DeveloperApi
   def execute(extension: Array[Byte]): Unit = {
+val any = ProtoUtils.parseWithRecursionLimit(
+  extension,
+  com.google.protobuf.Any.parser(),
+  recursionLimit = client.configuration.grpcMaxRecursionLimit)
 val command = proto.Command
   .newBuilder()
-  .setExtension(com.google.protobuf.Any.parseFrom(extension))
+  .setExtension(any)
   .build()
 execute(command)
   }
diff --git 
a/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/SparkConnectClient.scala
 
b/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/SparkConnectClient.scala
index d9d51c15a880..1e7b4e6574dd 100644
--- 
a/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/SparkConnectClient.scala
+++ 
b/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/SparkConnectClient.scala
@@ -566,6 +566,13 @@ object SparkConnectClient {
 
 def grpcMaxMessageSize: Int = _configuration.grpcMaxMessageSize
 
+def grpcMaxRecursionLimit(recursionLimit: Int): Builder = {
+  _configuration = _configuration.copy(grpcMaxRecursionLimit = 
recursionLimit)
+  this
+}
+
+def grpcMaxRecursionLimit: Int = _configuration.grpcMaxRecursionLimit
+
 def option(key: String, value: String): Builder = {
   _configuration = _configuration.copy(metadata = _configuration.metadata 
+ ((key, value)))
   this
@@ -703,7 +710,8 @@ object SparkConnectClient {
   useReattachableExecute: Boolean = true,
   interceptors: List[ClientInterceptor] = List.empty,
   sessionId: Option[String] = None,
-  grpcMaxMessageSize: Int = ConnectCommon.CONNECT_GRPC_MAX_MESSAGE_SIZE) {
+  grpcMaxMessag

Re: [PR] Fix title for spark 3.5.1 PySpark doc [spark-website]

2024-04-16 Thread via GitHub



panbingkun commented on PR #514:
URL: https://github.com/apache/spark-website/pull/514#issuecomment-2060057109

   cc @HeartSaVioR @dongjoon-hyun @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Re: [PR] Fix title for spark 3.5.1 PySpark doc [spark-website]

2024-04-16 Thread via GitHub



panbingkun commented on PR #514:
URL: https://github.com/apache/spark-website/pull/514#issuecomment-2060055831

   https://github.com/apache/spark-website/assets/15246973/3f38fb82-a438-428f-8af6-e2851926c7b5";>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[PR] Fix title for spark 3.5.1 PySpark doc [spark-website]

2024-04-16 Thread via GitHub



panbingkun opened a new pull request, #514:
URL: https://github.com/apache/spark-website/pull/514

   The pr aims to fix `title` for spark `3.5.1` PySpark doc.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-47590][SQL] Hive-thriftserver: Migrate logWarn with variables to structured logging framework

2024-04-16 Thread gengliang

This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f7440f384191 [SPARK-47590][SQL] Hive-thriftserver: Migrate logWarn 
with variables to structured logging framework
f7440f384191 is described below

commit f7440f3841918f2cdb4a8e710cfe31d3fc85230c
Author: Haejoon Lee 
AuthorDate: Tue Apr 16 13:56:03 2024 -0700

[SPARK-47590][SQL] Hive-thriftserver: Migrate logWarn with variables to 
structured logging framework

### What changes were proposed in this pull request?

This PR proposes to migrate `logWarning` with variables of 
Hive-thriftserver module to structured logging framework.

### Why are the changes needed?

To improve the existing logging system by migrating into structured logging.

### Does this PR introduce _any_ user-facing change?

No API changes, but the SQL catalyst logs will contain MDC(Mapped 
Diagnostic Context) from now.

### How was this patch tested?

Run Scala auto formatting and style check. Also the existing CI should pass.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45923 from itholic/hive-ts-logwarn.

Lead-authored-by: Haejoon Lee 
Co-authored-by: Haejoon Lee <44108233+itho...@users.noreply.github.com>
Signed-off-by: Gengliang Wang 
---
 .../scala/org/apache/spark/internal/LogKey.scala   |  1 +
 .../SparkExecuteStatementOperation.scala   |  4 ++-
 .../sql/hive/thriftserver/SparkSQLCLIDriver.scala  | 15 -
 .../ui/HiveThriftServer2Listener.scala | 36 --
 4 files changed, 38 insertions(+), 18 deletions(-)

diff --git a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala 
b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala
index 41289c641424..bfeb733af30a 100644
--- a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala
+++ b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala
@@ -94,6 +94,7 @@ object LogKey extends Enumeration {
   val FUNCTION_PARAMETER = Value
   val GROUP_ID = Value
   val HADOOP_VERSION = Value
+  val HISTORY_DIR = Value
   val HIVE_OPERATION_STATE = Value
   val HIVE_OPERATION_TYPE = Value
   val HOST = Value
diff --git 
a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
 
b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
index 628925007f7e..f8f58cd422b6 100644
--- 
a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
+++ 
b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
@@ -256,7 +256,9 @@ private[hive] class SparkExecuteStatementOperation(
 val currentState = getStatus().getState()
 if (currentState.isTerminal) {
   // This may happen if the execution was cancelled, and then closed 
from another thread.
-  logWarning(s"Ignore exception in terminal state with $statementId: 
$e")
+  logWarning(
+log"Ignore exception in terminal state with ${MDC(STATEMENT_ID, 
statementId)}", e
+  )
 } else {
   logError(log"Error executing query with ${MDC(STATEMENT_ID, 
statementId)}, " +
 log"currentState ${MDC(HIVE_OPERATION_STATE, currentState)}, ", e)
diff --git 
a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
 
b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
index 03d8fd0c8ff2..888c086e9042 100644
--- 
a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
+++ 
b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala
@@ -41,7 +41,7 @@ import sun.misc.{Signal, SignalHandler}
 import org.apache.spark.{ErrorMessageFormat, SparkConf, SparkThrowable, 
SparkThrowableHelper}
 import org.apache.spark.deploy.SparkHadoopUtil
 import org.apache.spark.internal.{Logging, MDC}
-import org.apache.spark.internal.LogKey.ERROR
+import org.apache.spark.internal.LogKey._
 import org.apache.spark.sql.AnalysisException
 import org.apache.spark.sql.catalyst.analysis.FunctionRegistry
 import org.apache.spark.sql.catalyst.util.SQLKeywordUtils
@@ -232,14 +232,14 @@ private[hive] object SparkSQLCLIDriver extends Logging {
 val historyFile = historyDirectory + File.separator + ".hivehistory"
 reader.setHistory(new FileHistory(new File(historyFile)))
   } else {
-logWarning("WARNING: Directory for Hive history file: " + 
historyDirectory +
-   " does not exist.   H

(spark) branch master updated (9a1fc112677f -> 6919febfcc87)

2024-04-16 Thread gengliang

This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 9a1fc112677f [SPARK-47871][SQL] Oracle: Map TimestampType to TIMESTAMP 
WITH LOCAL TIME ZONE
 add 6919febfcc87 [SPARK-47594] Connector module: Migrate logInfo with 
variables to structured logging framework

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/internal/LogKey.scala   | 36 +-
 .../org/apache/spark/sql/avro/AvroUtils.scala  |  7 +++--
 .../execution/ExecuteGrpcResponseSender.scala  | 33 +++-
 .../execution/ExecuteResponseObserver.scala| 19 +++-
 .../sql/connect/planner/SparkConnectPlanner.scala  |  7 +++--
 .../planner/StreamingForeachBatchHelper.scala  | 20 
 .../planner/StreamingQueryListenerHelper.scala |  7 +++--
 .../sql/connect/service/LoggingInterceptor.scala   |  9 --
 .../spark/sql/connect/service/SessionHolder.scala  | 15 ++---
 .../service/SparkConnectExecutionManager.scala | 17 ++
 .../sql/connect/service/SparkConnectServer.scala   |  7 +++--
 .../sql/connect/service/SparkConnectService.scala  |  5 +--
 .../service/SparkConnectSessionManager.scala   | 11 +--
 .../service/SparkConnectStreamingQueryCache.scala  | 26 +++-
 .../spark/sql/connect/utils/ErrorUtils.scala   |  4 +--
 .../sql/kafka010/KafkaBatchPartitionReader.scala   | 14 ++---
 .../spark/sql/kafka010/KafkaContinuousStream.scala |  4 +--
 .../spark/sql/kafka010/KafkaMicroBatchStream.scala |  4 +--
 .../sql/kafka010/KafkaOffsetReaderAdmin.scala  |  4 +--
 .../sql/kafka010/KafkaOffsetReaderConsumer.scala   |  4 +--
 .../apache/spark/sql/kafka010/KafkaRelation.scala  |  7 +++--
 .../org/apache/spark/sql/kafka010/KafkaSink.scala  |  5 +--
 .../apache/spark/sql/kafka010/KafkaSource.scala| 11 ---
 .../apache/spark/sql/kafka010/KafkaSourceRDD.scala |  6 ++--
 .../sql/kafka010/consumer/KafkaDataConsumer.scala  | 13 +---
 .../kafka010/producer/CachedKafkaProducer.scala|  5 +--
 .../apache/spark/sql/kafka010/KafkaTestUtils.scala | 10 +++---
 .../kafka010/DirectKafkaInputDStream.scala |  9 --
 .../streaming/kafka010/KafkaDataConsumer.scala | 18 ++-
 .../apache/spark/streaming/kafka010/KafkaRDD.scala | 12 +---
 .../spark/streaming/kinesis/KinesisReceiver.scala  |  7 +++--
 .../streaming/kinesis/KinesisRecordProcessor.scala | 12 +---
 .../executor/profiler/ExecutorJVMProfiler.scala|  5 +--
 .../executor/profiler/ExecutorProfilerPlugin.scala |  6 ++--
 .../scala/org/apache/spark/deploy/Client.scala |  6 ++--
 .../spark/deploy/yarn/ApplicationMaster.scala  |  4 +--
 .../scheduler/cluster/YarnSchedulerBackend.scala   |  6 ++--
 37 files changed, 257 insertions(+), 138 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-47871][SQL] Oracle: Map TimestampType to TIMESTAMP WITH LOCAL TIME ZONE

2024-04-16 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9a1fc112677f [SPARK-47871][SQL] Oracle: Map TimestampType to TIMESTAMP 
WITH LOCAL TIME ZONE
9a1fc112677f is described below

commit 9a1fc112677f98089d946b3bf4f52b33ab0a5c23
Author: Kent Yao 
AuthorDate: Tue Apr 16 08:35:51 2024 -0700

[SPARK-47871][SQL] Oracle: Map TimestampType to TIMESTAMP WITH LOCAL TIME 
ZONE

### What changes were proposed in this pull request?

This PR map TimestampType to TIMESTAMP WITH LOCAL TIME ZONE

### Why are the changes needed?

We currently map both TimestampType and TimestampNTZType to Oracle's 
TIMESTAMP which represents a timestamp without time zone. This is ambiguous

### Does this PR introduce _any_ user-facing change?

It does not affect spark users to play a TimestampType read-write-read 
roundtrip, but might affect other systems' reading

### How was this patch tested?

existing test with new configuration
```java
SPARK-42627: Support ORACLE TIMESTAMP WITH LOCAL TIME ZONE (9 seconds, 536 
milliseconds)
```

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #46080 from yaooqinn/SPARK-47871.

Authored-by: Kent Yao 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/jdbc/OracleIntegrationSuite.scala| 39 --
 docs/sql-migration-guide.md|  1 +
 .../org/apache/spark/sql/internal/SQLConf.scala| 12 +++
 .../org/apache/spark/sql/jdbc/OracleDialect.scala  |  5 ++-
 .../org/apache/spark/sql/jdbc/JDBCSuite.scala  |  5 ++-
 5 files changed, 43 insertions(+), 19 deletions(-)

diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala
index 418b86fb6b23..496498e5455b 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala
@@ -547,23 +547,28 @@ class OracleIntegrationSuite extends 
DockerJDBCIntegrationSuite with SharedSpark
   }
 
   test("SPARK-42627: Support ORACLE TIMESTAMP WITH LOCAL TIME ZONE") {
-val reader = spark.read.format("jdbc")
-  .option("url", jdbcUrl)
-  .option("dbtable", "test_ltz")
-val df = reader.load()
-val row1 = df.collect().head.getTimestamp(0)
-assert(df.count() === 1)
-assert(row1 === Timestamp.valueOf("2018-11-17 13:33:33"))
-
-df.write.format("jdbc")
-  .option("url", jdbcUrl)
-  .option("dbtable", "test_ltz")
-  .mode("append")
-  .save()
-
-val df2 = reader.load()
-assert(df.count() === 2)
-assert(df2.collect().forall(_.getTimestamp(0) === row1))
+Seq("true", "false").foreach { flag =>
+  withSQLConf((SQLConf.LEGACY_ORACLE_TIMESTAMP_MAPPING_ENABLED.key, flag)) 
{
+val df = spark.read.format("jdbc")
+  .option("url", jdbcUrl)
+  .option("dbtable", "test_ltz")
+  .load()
+val row1 = df.collect().head.getTimestamp(0)
+assert(df.count() === 1)
+assert(row1 === Timestamp.valueOf("2018-11-17 13:33:33"))
+
+df.write.format("jdbc")
+  .option("url", jdbcUrl)
+  .option("dbtable", "test_ltz" + flag)
+  .save()
+
+val df2 = spark.read.format("jdbc")
+  .option("url", jdbcUrl)
+  .option("dbtable", "test_ltz" + flag)
+  .load()
+checkAnswer(df2, Row(row1))
+  }
+}
   }
 
   test("SPARK-47761: Reading ANSI INTERVAL Types") {
diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index c7bd0b55840c..3004008b8ec7 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -45,6 +45,7 @@ license: |
 - Since Spark 4.0, MySQL JDBC datasource will read FLOAT as FloatType, while 
in Spark 3.5 and previous, it was read as DoubleType. To restore the previous 
behavior, you can cast the column to the old type.
 - Since Spark 4.0, MySQL JDBC datasource will read BIT(n > 1) as BinaryType, 
while in Spark 3.5 and previous, read as LongType. To restore the previous 
behavior, set `spark.sql.legacy.mysql.bitArrayMapping.enabled` to `true`.
 - Since Spark 4.0, MySQL JDBC datasource will write ShortType as SMALLINT, 
while in Spark 3.5 and previous, write as INTEGER. To restore the previous 
behavior, you can replace the column with IntegerType whenever before writing.
+- Since Spark 4.0, Oracle JDBC datasource will write TimestampType as 
TIMESTAMP WITH LOCAL TIME ZONE, while in Spark 3.5 and previous, write as 
TIMESTAMP. To restore the previous behavior,

(spark) branch master updated: [SPARK-47417][SQL] Collation support: Ascii, Chr, Base64, UnBase64, Decode, StringDecode, Encode, ToBinary, FormatNumber, Sentences

2024-04-16 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ee2673f2e948 [SPARK-47417][SQL] Collation support: Ascii, Chr, Base64, 
UnBase64, Decode, StringDecode, Encode, ToBinary, FormatNumber, Sentences
ee2673f2e948 is described below

commit ee2673f2e94811022f6a3d9a03ad119f7a8e5d65
Author: Nikola Mandic 
AuthorDate: Tue Apr 16 23:09:23 2024 +0800

[SPARK-47417][SQL] Collation support: Ascii, Chr, Base64, UnBase64, Decode, 
StringDecode, Encode, ToBinary, FormatNumber, Sentences

### What changes were proposed in this pull request?

`Chr` and `Base64` are skipped as they don't accept input string types and 
don't need to be updated. Other functions are updated to accept collated 
strings as inputs.

### Why are the changes needed?

Add collations support in string functions.

### Does this PR introduce _any_ user-facing change?

Yes, it changes behavior of string functions when string parameters have 
collation.

### How was this patch tested?

Add checks to `CollationStringExpressionsSuite`.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45933 from nikolamand-db/SPARK-47417-47418-47420.

Authored-by: Nikola Mandic 
Signed-off-by: Wenchen Fan 
---
 .../catalyst/expressions/stringExpressions.scala   | 40 +--
 .../expressions/StringExpressionsSuite.scala   |  3 +-
 .../sql/CollationStringExpressionsSuite.scala  | 77 +-
 3 files changed, 99 insertions(+), 21 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
index 4fe57b4f8f02..b3029302c03d 100755
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
@@ -2352,7 +2352,7 @@ case class Ascii(child: Expression)
   extends UnaryExpression with ImplicitCastInputTypes with NullIntolerant {
 
   override def dataType: DataType = IntegerType
-  override def inputTypes: Seq[DataType] = Seq(StringType)
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringTypeAnyCollation)
 
   protected override def nullSafeEval(string: Any): Any = {
 // only pick the first character to reduce the `toString` cost
@@ -2398,7 +2398,7 @@ case class Ascii(child: Expression)
 case class Chr(child: Expression)
   extends UnaryExpression with ImplicitCastInputTypes with NullIntolerant {
 
-  override def dataType: DataType = StringType
+  override def dataType: DataType = SQLConf.get.defaultStringType
   override def inputTypes: Seq[DataType] = Seq(LongType)
 
   protected override def nullSafeEval(lon: Any): Any = {
@@ -2447,7 +2447,7 @@ case class Chr(child: Expression)
 case class Base64(child: Expression)
   extends UnaryExpression with ImplicitCastInputTypes with NullIntolerant {
 
-  override def dataType: DataType = StringType
+  override def dataType: DataType = SQLConf.get.defaultStringType
   override def inputTypes: Seq[DataType] = Seq(BinaryType)
 
   protected override def nullSafeEval(bytes: Any): Any = {
@@ -2480,7 +2480,7 @@ case class UnBase64(child: Expression, failOnError: 
Boolean = false)
   extends UnaryExpression with ImplicitCastInputTypes with NullIntolerant {
 
   override def dataType: DataType = BinaryType
-  override def inputTypes: Seq[DataType] = Seq(StringType)
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringTypeAnyCollation)
 
   def this(expr: Expression) = this(expr, false)
 
@@ -2672,8 +2672,8 @@ case class StringDecode(bin: Expression, charset: 
Expression, legacyCharsets: Bo
 
   override def left: Expression = bin
   override def right: Expression = charset
-  override def dataType: DataType = StringType
-  override def inputTypes: Seq[DataType] = Seq(BinaryType, StringType)
+  override def dataType: DataType = SQLConf.get.defaultStringType
+  override def inputTypes: Seq[AbstractDataType] = Seq(BinaryType, 
StringTypeAnyCollation)
 
   private val supportedCharsets = Set(
 "US-ASCII", "ISO-8859-1", "UTF-8", "UTF-16BE", "UTF-16LE", "UTF-16")
@@ -2750,7 +2750,8 @@ case class Encode(str: Expression, charset: Expression, 
legacyCharsets: Boolean)
   override def left: Expression = str
   override def right: Expression = charset
   override def dataType: DataType = BinaryType
-  override def inputTypes: Seq[DataType] = Seq(StringType, StringType)
+  override def inputTypes: Seq[AbstractDataType] =
+Seq(StringTypeAnyCollation, StringTypeAnyCollation)
 
   private val supportedCharsets = Set(
 "US-ASCII", "ISO-8859-1", "UTF-8", "UTF-16BE", "UTF-

(spark-kubernetes-operator) branch main updated: [SPARK-47745] Add License to Spark Operator repository

2024-04-16 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/spark-kubernetes-operator.git


The following commit(s) were added to refs/heads/main by this push:
 new 7a3a7e8  [SPARK-47745] Add License to Spark Operator repository
7a3a7e8 is described below

commit 7a3a7e882af2c8e8d463ebed71329212133d229c
Author: zhou-jiang 
AuthorDate: Tue Apr 16 08:08:26 2024 -0700

[SPARK-47745] Add License to Spark Operator repository

### What changes were proposed in this pull request?

This PR aims to add ASF license file.

### Why are the changes needed?

To receive a code contribution.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #3 from jiangzho/license.

Authored-by: zhou-jiang 
Signed-off-by: Dongjoon Hyun 
---
 LICENSE | 201 
 1 file changed, 201 insertions(+)

diff --git a/LICENSE b/LICENSE
new file mode 100644
index 000..261eeb9
--- /dev/null
+++ b/LICENSE
@@ -0,0 +1,201 @@
+ Apache License
+   Version 2.0, January 2004
+http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+  "License" shall mean the terms and conditions for use, reproduction,
+  and distribution as defined by Sections 1 through 9 of this document.
+
+  "Licensor" shall mean the copyright owner or entity authorized by
+  the copyright owner that is granting the License.
+
+  "Legal Entity" shall mean the union of the acting entity and all
+  other entities that control, are controlled by, or are under common
+  control with that entity. For the purposes of this definition,
+  "control" means (i) the power, direct or indirect, to cause the
+  direction or management of such entity, whether by contract or
+  otherwise, or (ii) ownership of fifty percent (50%) or more of the
+  outstanding shares, or (iii) beneficial ownership of such entity.
+
+  "You" (or "Your") shall mean an individual or Legal Entity
+  exercising permissions granted by this License.
+
+  "Source" form shall mean the preferred form for making modifications,
+  including but not limited to software source code, documentation
+  source, and configuration files.
+
+  "Object" form shall mean any form resulting from mechanical
+  transformation or translation of a Source form, including but
+  not limited to compiled object code, generated documentation,
+  and conversions to other media types.
+
+  "Work" shall mean the work of authorship, whether in Source or
+  Object form, made available under the License, as indicated by a
+  copyright notice that is included in or attached to the work
+  (an example is provided in the Appendix below).
+
+  "Derivative Works" shall mean any work, whether in Source or Object
+  form, that is based on (or derived from) the Work and for which the
+  editorial revisions, annotations, elaborations, or other modifications
+  represent, as a whole, an original work of authorship. For the purposes
+  of this License, Derivative Works shall not include works that remain
+  separable from, or merely link (or bind by name) to the interfaces of,
+  the Work and Derivative Works thereof.
+
+  "Contribution" shall mean any work of authorship, including
+  the original version of the Work and any modifications or additions
+  to that Work or Derivative Works thereof, that is intentionally
+  submitted to Licensor for inclusion in the Work by the copyright owner
+  or by an individual or Legal Entity authorized to submit on behalf of
+  the copyright owner. For the purposes of this definition, "submitted"
+  means any form of electronic, verbal, or written communication sent
+  to the Licensor or its representatives, including but not limited to
+  communication on electronic mailing lists, source code control systems,
+  and issue tracking systems that are managed by, or on behalf of, the
+  Licensor for the purpose of discussing and improving the Work, but
+  excluding communication that is conspicuously marked or otherwise
+  designated in writing by the copyright owner as "Not a Contribution."
+
+  "Contributor" shall mean Licensor and any individual or Legal Entity
+  on behalf of whom a Contribution has been received by Licensor and
+  subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+  this License, each Contributor hereby grants t

(spark-kubernetes-operator) branch main updated: Update GITHUB_API_BASE

2024-04-16 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/spark-kubernetes-operator.git


The following commit(s) were added to refs/heads/main by this push:
 new a8eb690  Update GITHUB_API_BASE
a8eb690 is described below

commit a8eb690a7a85fd2b580e3756fad8d2bcf306e12c
Author: Dongjoon Hyun 
AuthorDate: Tue Apr 16 08:06:10 2024 -0700

Update GITHUB_API_BASE
---
 dev/merge_spark_pr.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/dev/merge_spark_pr.py b/dev/merge_spark_pr.py
index 4647383..24e956d 100755
--- a/dev/merge_spark_pr.py
+++ b/dev/merge_spark_pr.py
@@ -65,7 +65,7 @@ GITHUB_OAUTH_KEY = os.environ.get("GITHUB_OAUTH_KEY")
 
 
 GITHUB_BASE = "https://github.com/apache/spark-kubernetes-operator/pull";
-GITHUB_API_BASE = "https://api.github.com/repos/spark-kubernetes-operator";
+GITHUB_API_BASE = 
"https://api.github.com/repos/apache/spark-kubernetes-operator";
 JIRA_BASE = "https://issues.apache.org/jira/browse";
 JIRA_API_BASE = "https://issues.apache.org/jira";
 # Prefix added to temporary branches


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (4dad2170b05c -> 9b1b2b30d591)

2024-04-16 Thread hvanhovell

This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 4dad2170b05c [SPARK-47356][SQL] Add support for ConcatWs & Elt (all 
collations)
 add 9b1b2b30d591 [SPARK-47081][CONNECT][FOLLOW] Unflake Progress Execution

No new revisions were added by this update.

Summary of changes:
 .../sql/connect/execution/ExecuteGrpcResponseSender.scala   |  2 --
 .../sql/connect/execution/ExecuteResponseObserver.scala | 10 ++
 .../spark/sql/connect/execution/ExecuteThreadRunner.scala   |  5 +++--
 python/pyspark/sql/connect/client/core.py   | 10 +++---
 python/pyspark/sql/tests/connect/shell/test_progress.py | 13 +
 5 files changed, 29 insertions(+), 11 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-47356][SQL] Add support for ConcatWs & Elt (all collations)

2024-04-16 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 4dad2170b05c [SPARK-47356][SQL] Add support for ConcatWs & Elt (all 
collations)
4dad2170b05c is described below

commit 4dad2170b05c04faf1da550ab3fb8c52a61b8be7
Author: Mihailo Milosevic 
AuthorDate: Tue Apr 16 21:21:24 2024 +0800

[SPARK-47356][SQL] Add support for ConcatWs & Elt (all collations)

### What changes were proposed in this pull request?
Addition of support for ConcatWs and Elt expressions.

### Why are the changes needed?
We need to enable these functions to support collations in order to scope 
all functions.

### Does this PR introduce _any_ user-facing change?
Yes, both expressions now will not return error when called with collated 
strings.

### How was this patch tested?
Addition of tests to `CollationStringExpressionsSuite`

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46061 from mihailom-db/SPARK-47356.

Authored-by: Mihailo Milosevic 
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/analysis/CollationTypeCasts.scala |  5 ++-
 .../catalyst/expressions/stringExpressions.scala   | 25 ++--
 .../sql/CollationStringExpressionsSuite.scala  | 46 --
 3 files changed, 51 insertions(+), 25 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
index 1a14b4227de8..795e8a696b01 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
@@ -22,7 +22,7 @@ import javax.annotation.Nullable
 import scala.annotation.tailrec
 
 import org.apache.spark.sql.catalyst.analysis.TypeCoercion.{hasStringType, 
haveSameType}
-import org.apache.spark.sql.catalyst.expressions.{ArrayJoin, BinaryExpression, 
CaseWhen, Cast, Coalesce, Collate, Concat, ConcatWs, CreateArray, Expression, 
Greatest, If, In, InSubquery, Least}
+import org.apache.spark.sql.catalyst.expressions.{ArrayJoin, BinaryExpression, 
CaseWhen, Cast, Coalesce, Collate, Concat, ConcatWs, CreateArray, Elt, 
Expression, Greatest, If, In, InSubquery, Least}
 import org.apache.spark.sql.errors.QueryCompilationErrors
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types.{ArrayType, DataType, StringType}
@@ -45,6 +45,9 @@ object CollationTypeCasts extends TypeCoercionRule {
   caseWhenExpr.elseValue.map(e => castStringType(e, 
outputStringType).getOrElse(e))
 CaseWhen(newBranches, newElseValue)
 
+case eltExpr: Elt =>
+  eltExpr.withNewChildren(eltExpr.children.head +: 
collateToSingleType(eltExpr.children.tail))
+
 case otherExpr @ (
   _: In | _: InSubquery | _: CreateArray | _: ArrayJoin | _: Concat | _: 
Greatest | _: Least |
   _: Coalesce | _: BinaryExpression | _: ConcatWs) =>
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
index 34e8f3f40859..4fe57b4f8f02 100755
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
@@ -37,7 +37,7 @@ import 
org.apache.spark.sql.catalyst.trees.TreePattern.{TreePattern, UPPER_OR_LO
 import org.apache.spark.sql.catalyst.util.{ArrayData, CollationSupport, 
GenericArrayData, TypeUtils}
 import org.apache.spark.sql.errors.{QueryCompilationErrors, 
QueryExecutionErrors}
 import org.apache.spark.sql.internal.SQLConf
-import org.apache.spark.sql.internal.types.StringTypeAnyCollation
+import org.apache.spark.sql.internal.types.{AbstractArrayType, 
StringTypeAnyCollation}
 import org.apache.spark.sql.types._
 import org.apache.spark.unsafe.UTF8StringBuilder
 import org.apache.spark.unsafe.array.ByteArrayMethods
@@ -79,11 +79,12 @@ case class ConcatWs(children: Seq[Expression])
 
   /** The 1st child (separator) is str, and rest are either str or array of 
str. */
   override def inputTypes: Seq[AbstractDataType] = {
-val arrayOrStr = TypeCollection(ArrayType(StringType), StringType)
-StringType +: Seq.fill(children.size - 1)(arrayOrStr)
+val arrayOrStr =
+  TypeCollection(AbstractArrayType(StringTypeAnyCollation), 
StringTypeAnyCollation)
+StringTypeAnyCollation +: Seq.fill(children.size - 1)(arrayOrStr)
   }
 
-  override def dataType: DataType = StringType
+  override def dataType: DataType = children.head.dataType
 
   override def nullabl

(spark) branch master updated: [SPARK-47739][SQL] Register logical avro type

2024-04-16 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new fa2e9c7275aa [SPARK-47739][SQL] Register logical avro type
fa2e9c7275aa is described below

commit fa2e9c7275aa1c09652d0df0992565c32974b2b9
Author: milastdbx 
AuthorDate: Tue Apr 16 03:38:19 2024 -0700

[SPARK-47739][SQL] Register logical avro type

### What changes were proposed in this pull request?
In this pull request I propose that we register logical avro types when we 
initialize `AvroUtils` and `AvroFileFormat`, otherwise for first schema 
discovery we might get wrong result on very first execution after spark starts.
https://github.com/apache/spark/assets/150366084/3eaba6e3-34ec-4ca9-ae89-d0259ce942ba";>

example
```scala
val new_schema = """
 | {
 |   "type": "record",
 |   "name": "Entry",
 |   "fields": [
 | {
 |   "name": "rate",
 |   "type": [
 | "null",
 | {
 |   "type": "long",
 |   "logicalType": "custom-decimal",
 |   "precision": 38,
 |   "scale": 9
 | }
 |   ],
 |   "default": null
 | }
 |   ]
 | }""".stripMargin
spark.read.format("avro").option("avroSchema", 
new_schema).load().printSchema // maps to long - WRONG
spark.read.format("avro").option("avroSchema", 
new_schema).load().printSchema // maps to Decimal - CORRECT
```

### Why are the changes needed?
To fix issue with resolving avro schema upon spark startup.

### Does this PR introduce _any_ user-facing change?
No, its a bugfix

### How was this patch tested?
Unit tests

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #45895 from milastdbx/dev/milast/fixAvroLogicalTypeRegistration.

Lead-authored-by: milastdbx 
Co-authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../org/apache/spark/sql/avro/AvroFileFormat.scala | 21 --
 .../spark/sql/avro/AvroLogicalTypeInitSuite.scala  | 76 ++
 2 files changed, 91 insertions(+), 6 deletions(-)

diff --git 
a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala 
b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala
index 2792edaea284..372f24b54f5c 100755
--- 
a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala
+++ 
b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala
@@ -43,6 +43,8 @@ import org.apache.spark.util.SerializableConfiguration
 private[sql] class AvroFileFormat extends FileFormat
   with DataSourceRegister with Logging with Serializable {
 
+  AvroFileFormat.registerCustomAvroTypes()
+
   override def equals(other: Any): Boolean = other match {
 case _: AvroFileFormat => true
 case _ => false
@@ -173,10 +175,17 @@ private[sql] class AvroFileFormat extends FileFormat
 private[avro] object AvroFileFormat {
   val IgnoreFilesWithoutExtensionProperty = 
"avro.mapred.ignore.inputs.without.extension"
 
-  // Register the customized decimal type backed by long.
-  LogicalTypes.register(CustomDecimal.TYPE_NAME, new 
LogicalTypes.LogicalTypeFactory {
-override def fromSchema(schema: Schema): LogicalType = {
-  new CustomDecimal(schema)
-}
-  })
+  /**
+   * Register Spark defined custom Avro types.
+   */
+  def registerCustomAvroTypes(): Unit = {
+// Register the customized decimal type backed by long.
+LogicalTypes.register(CustomDecimal.TYPE_NAME, new 
LogicalTypes.LogicalTypeFactory {
+  override def fromSchema(schema: Schema): LogicalType = {
+new CustomDecimal(schema)
+  }
+})
+  }
+
+  registerCustomAvroTypes()
 }
diff --git 
a/connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroLogicalTypeInitSuite.scala
 
b/connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroLogicalTypeInitSuite.scala
new file mode 100644
index ..126440ed69b8
--- /dev/null
+++ 
b/connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroLogicalTypeInitSuite.scala
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed o

(spark) branch master updated: [SPARK-46574][BUILD] Upgrade maven plugin to latest version

2024-04-16 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b7a729bfd19c [SPARK-46574][BUILD] Upgrade maven plugin to latest 
version
b7a729bfd19c is described below

commit b7a729bfd19cfa7a06d208f3899d329e414d5598
Author: panbingkun 
AuthorDate: Tue Apr 16 03:30:12 2024 -0700

[SPARK-46574][BUILD] Upgrade maven plugin to latest version

### What changes were proposed in this pull request?

### Why are the changes needed?
- `exec-maven-plugin` from `3.1.0` to `3.2.0`
https://github.com/mojohaus/exec-maven-plugin/releases/tag/3.2.0
https://github.com/mojohaus/exec-maven-plugin/releases/tag/3.1.1
Bug Fixes:
1.Fix https://github.com/mojohaus/exec-maven-plugin/issues/158 - Fix non 
ascii character handling 
(https://github.com/mojohaus/exec-maven-plugin/pull/372)
2.[https://github.com/mojohaus/exec-maven-plugin/issues/323] exec arguments 
missing (https://github.com/mojohaus/exec-maven-plugin/pull/324)

- `build-helper-maven-plugin` from `3.4.0` to `3.5.0`
https://github.com/mojohaus/build-helper-maven-plugin/releases/tag/3.5.0

- `maven-compiler-plugin` from `3.12.1` to `3.13.0`

https://github.com/apache/maven-compiler-plugin/releases/tag/maven-compiler-plugin-3.13.0

- `maven-jar-plugin` from `3.3.0` to `3.4.0`

https://github.com/apache/maven-jar-plugin/releases/tag/maven-jar-plugin-3.4.0
[[MJAR-62]](https://issues.apache.org/jira/browse/MJAR-62) - Set Build-Jdk 
according to used toolchain (https://github.com/apache/maven-jar-plugin/pull/73)

- `maven-source-plugin` from `3.3.0` to `3.3.1`

https://github.com/apache/maven-source-plugin/releases/tag/maven-source-plugin-3.3.1

- `maven-assembly-plugin` from `3.6.0` to `3.7.1`

https://github.com/apache/maven-assembly-plugin/releases/tag/maven-assembly-plugin-3.7.1

https://github.com/apache/maven-assembly-plugin/releases/tag/maven-assembly-plugin-3.7.0
Bug Fixes:
1.[[MASSEMBLY-967](https://issues.apache.org/jira/browse/MASSEMBLY-967)] - 
maven-assembly-plugin doesn't add target/class artifacts in generated jarfat 
but META-INF/MANIFEST.MF seems to be correct
2.[[MASSEMBLY-994](https://issues.apache.org/jira/browse/MASSEMBLY-994)] - 
Items from unpacked dependency are not refreshed
3.[[MASSEMBLY-998](https://issues.apache.org/jira/browse/MASSEMBLY-998)] - 
Transitive dependencies are not properly excluded as of 3.1.1
4.[[MASSEMBLY-1008](https://issues.apache.org/jira/browse/MASSEMBLY-1008)] 
- Assembly plugin handles scopes wrongly
5.[[MASSEMBLY-1020](https://issues.apache.org/jira/browse/MASSEMBLY-1020)] 
- Cannot invoke "java.io.File.isFile()" because "this.inputFile" is null
6.[[MASSEMBLY-1021](https://issues.apache.org/jira/browse/MASSEMBLY-1021)] 
- Nullpointer in assembly:single when upgrading to 3.7.0
7.[[MASSEMBLY-1022](https://issues.apache.org/jira/browse/MASSEMBLY-1022)] 
- Unresolved artifacts should be not processed

- `cyclonedx-maven-plugin` from `2.7.9` to `2.8.0`

https://github.com/CycloneDX/cyclonedx-maven-plugin/releases/tag/cyclonedx-maven-plugin-2.8.0

https://github.com/CycloneDX/cyclonedx-maven-plugin/releases/tag/cyclonedx-maven-plugin-2.7.11

https://github.com/CycloneDX/cyclonedx-maven-plugin/releases/tag/cyclonedx-maven-plugin-2.7.10
Bug Fixes:
1.check if configured schemaVersion is supported 
(https://github.com/CycloneDX/cyclonedx-maven-plugin/pull/479)
2.ignore bomGenerator.generate() call 
(https://github.com/CycloneDX/cyclonedx-maven-plugin/pull/376)

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46043 from panbingkun/update_maven_plugins.

Authored-by: panbingkun 
Signed-off-by: Dongjoon Hyun 
---
 pom.xml | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/pom.xml b/pom.xml
index 99b238aac1dc..bf8d4f1b417d 100644
--- a/pom.xml
+++ b/pom.xml
@@ -116,7 +116,7 @@
 17
 ${java.version}
 3.9.6
-3.1.0
+3.2.0
 spark
 9.6
 2.0.13
@@ -2994,7 +2994,7 @@
 
   org.codehaus.mojo
   build-helper-maven-plugin
-  3.4.0
+  3.5.0
   
 
   module-timestamp-property
@@ -3108,7 +3108,7 @@
 
   org.apache.maven.plugins
   maven-compiler-plugin
-  3.12.1
+  3.13.0
   
 ${java.version}
 true 
@@ -3234,7 +3234,7 @@
 
   org.apache.maven.plugins
   maven-jar-plugin
-  3.3.0
+  3.4.0
 
 
   org.apache.maven.plugins
@@ -3244,7 +324

(spark) branch master updated: [SPARK-47818][CONNECT] Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests

2024-04-16 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a1fc6d57b27d [SPARK-47818][CONNECT] Introduce plan cache in 
SparkConnectPlanner to improve performance of Analyze requests
a1fc6d57b27d is described below

commit a1fc6d57b27d24b832b2f2580e6acd64c4488c62
Author: Xi Lyu 
AuthorDate: Tue Apr 16 16:27:32 2024 +0900

[SPARK-47818][CONNECT] Introduce plan cache in SparkConnectPlanner to 
improve performance of Analyze requests

### What changes were proposed in this pull request?

While building the DataFrame step by step, each time a new DataFrame is 
generated with an empty schema, which is lazily computed on access. However, if 
a user's code frequently accesses the schema of these new DataFrames using 
methods such as `df.columns`, it will result in a large number of Analyze 
requests to the server. Each time, the entire plan needs to be reanalyzed, 
leading to poor performance, especially when constructing highly complex plans.

Now, by introducing plan cache in SparkConnectPlanner, we aim to reduce the 
overhead of repeated analysis during this process. This is achieved by saving 
significant computation if the resolved logical plan of a subtree of can be 
cached.

A minimal example of the problem:

```
import pyspark.sql.functions as F
df = spark.range(10)
for i in range(200):
  if str(i) not in df.columns: # <-- The df.columns call causes a new 
Analyze request in every iteration
    df = df.withColumn(str(i), F.col("id") + i)
df.show()
```

With this patch, the performance of the above code improved from ~110s to 
~5s.

### Why are the changes needed?

The performance improvement is huge in the above cases.

### Does this PR introduce _any_ user-facing change?

Yes, a static conf `spark.connect.session.planCache.maxSize` and a dynamic 
conf `spark.connect.session.planCache.enabled` are added.

* `spark.connect.session.planCache.maxSize`: Sets the maximum number of 
cached resolved logical plans in Spark Connect Session. If set to a value less 
or equal than zero will disable the plan cache
* `spark.connect.session.planCache.enabled`: When true, the cache of 
resolved logical plans is enabled if `spark.connect.session.planCache.maxSize` 
is greater than zero. When false, the cache is disabled even if 
`spark.connect.session.planCache.maxSize` is greater than zero. The caching is 
best-effort and not guaranteed.

### How was this patch tested?

Some new tests are added in SparkConnectSessionHolderSuite.scala.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46012 from xi-db/SPARK-47818-plan-cache.

Lead-authored-by: Xi Lyu 
Co-authored-by: Xi Lyu <159039256+xi...@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon 
---
 .../apache/spark/sql/connect/config/Connect.scala  |  18 ++
 .../sql/connect/planner/SparkConnectPlanner.scala  | 201 -
 .../spark/sql/connect/service/SessionHolder.scala  |  79 +++-
 .../service/SparkConnectAnalyzeHandler.scala   |  26 +--
 .../service/SparkConnectSessionHolderSuite.scala   | 125 -
 5 files changed, 345 insertions(+), 104 deletions(-)

diff --git 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/config/Connect.scala
 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/config/Connect.scala
index 6ba100af1bb9..e94e86587393 100644
--- 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/config/Connect.scala
+++ 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/config/Connect.scala
@@ -273,4 +273,22 @@ object Connect {
   .version("4.0.0")
   .timeConf(TimeUnit.MILLISECONDS)
   .createWithDefaultString("2s")
+
+  val CONNECT_SESSION_PLAN_CACHE_SIZE =
+buildStaticConf("spark.connect.session.planCache.maxSize")
+  .doc("Sets the maximum number of cached resolved logical plans in Spark 
Connect Session." +
+" If set to a value less or equal than zero will disable the plan 
cache.")
+  .version("4.0.0")
+  .intConf
+  .createWithDefault(5)
+
+  val CONNECT_SESSION_PLAN_CACHE_ENABLED =
+buildConf("spark.connect.session.planCache.enabled")
+  .doc("When true, the cache of resolved logical plans is enabled if" +
+s" '${CONNECT_SESSION_PLAN_CACHE_SIZE.key}' is greater than zero." +
+s" When false, the cache is disabled even if 
'${CONNECT_SESSION_PLAN_CACHE_SIZE.key}' is" +
+" greater than zero. The caching is best-effort and not guaranteed.")
+  .version("4.0.0")
+  .booleanConf
+  .createWithDefault(true)
 }
diff --git 
a/connector/connect/server/src/main/scala/o

(spark) branch master updated: [SPARK-46812][CONNECT][PYTHON][FOLLOW-UP] Add pyspark.pyspark.sql.connect.resource into PyPi packaging

(spark) branch master updated: [SPARK-47588][CORE] Hive module: Migrate logInfo with variables to structured logging framework

(spark) branch master updated (268856da31c1 -> 2054ab0fb03f)

(spark) branch master updated (ab6338e09aa0 -> 268856da31c1)

(spark) branch master updated: [SPARK-47838][BUILD] Upgrade `rocksdbjni` to 8.11.4

(spark) branch master updated (f9ebe1b3d24b -> 6c827c10dc15)

(spark) branch master updated: [SPARK-46375][DOCS] Add user guide for Python data source API

(spark) branch master updated (5321353b24db -> 86837d3155b1)

(spark) branch master updated: [SPARK-47875][CORE] Remove `spark.deploy.recoverySerializer`

(spark) branch master updated: [SPARK-47760][SPARK-47763][CONNECT][TESTS] Reeanble Avro and Protobuf function doctests

(spark) branch master updated: [SPARK-47868][CONNECT] Fix recursion limit error in SparkConnectPlanner and SparkSession

Re: [PR] Fix title for spark 3.5.1 PySpark doc [spark-website]

Re: [PR] Fix title for spark 3.5.1 PySpark doc [spark-website]

[PR] Fix title for spark 3.5.1 PySpark doc [spark-website]

(spark) branch master updated: [SPARK-47590][SQL] Hive-thriftserver: Migrate logWarn with variables to structured logging framework

(spark) branch master updated (9a1fc112677f -> 6919febfcc87)

(spark) branch master updated: [SPARK-47871][SQL] Oracle: Map TimestampType to TIMESTAMP WITH LOCAL TIME ZONE

(spark) branch master updated: [SPARK-47417][SQL] Collation support: Ascii, Chr, Base64, UnBase64, Decode, StringDecode, Encode, ToBinary, FormatNumber, Sentences

(spark-kubernetes-operator) branch main updated: [SPARK-47745] Add License to Spark Operator repository

(spark-kubernetes-operator) branch main updated: Update GITHUB_API_BASE

(spark) branch master updated (4dad2170b05c -> 9b1b2b30d591)

(spark) branch master updated: [SPARK-47356][SQL] Add support for ConcatWs & Elt (all collations)

(spark) branch master updated: [SPARK-47739][SQL] Register logical avro type

(spark) branch master updated: [SPARK-46574][BUILD] Upgrade maven plugin to latest version

(spark) branch master updated: [SPARK-47818][CONNECT] Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests

25 matches

Site Navigation

Mail list logo

Footer information