date:20240509

(spark) branch branch-3.5 updated: [SPARK-47847][CORE] Deprecate `spark.network.remoteReadNioBufferConversion`

2024-05-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new c048653435f9 [SPARK-47847][CORE] Deprecate 
`spark.network.remoteReadNioBufferConversion`
c048653435f9 is described below

commit c048653435f9b7c832f79d38a504a145a17654c0
Author: Cheng Pan 
AuthorDate: Thu May 9 22:55:07 2024 -0700

[SPARK-47847][CORE] Deprecate `spark.network.remoteReadNioBufferConversion`

### What changes were proposed in this pull request?

`spark.network.remoteReadNioBufferConversion` was introduced in 
https://github.com/apache/spark/commit/2c82745686f4456c4d5c84040a431dcb5b6cb60b,
 to allow disable 
[SPARK-24307](https://issues.apache.org/jira/browse/SPARK-24307) for safety, 
while during the whole Spark 3 period, there are no negative reports, it proves 
that [SPARK-24307](https://issues.apache.org/jira/browse/SPARK-24307) is solid 
enough, I propose to mark it deprecated in 3.5.2 and remove in 4.1.0 or later

### Why are the changes needed?

Code clean up

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46047 from pan3793/SPARK-47847.

Authored-by: Cheng Pan 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 33cac4436e593c9c501c5ff0eedf923d3a21899c)
Signed-off-by: Dongjoon Hyun 
---
 core/src/main/scala/org/apache/spark/SparkConf.scala | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/core/src/main/scala/org/apache/spark/SparkConf.scala 
b/core/src/main/scala/org/apache/spark/SparkConf.scala
index 813a14acd19e..f49e9e357c84 100644
--- a/core/src/main/scala/org/apache/spark/SparkConf.scala
+++ b/core/src/main/scala/org/apache/spark/SparkConf.scala
@@ -638,7 +638,9 @@ private[spark] object SparkConf extends Logging {
   DeprecatedConfig("spark.blacklist.killBlacklistedExecutors", "3.1.0",
 "Please use spark.excludeOnFailure.killExcludedExecutors"),
   
DeprecatedConfig("spark.yarn.blacklist.executor.launch.blacklisting.enabled", 
"3.1.0",
-"Please use spark.yarn.executor.launch.excludeOnFailure.enabled")
+"Please use spark.yarn.executor.launch.excludeOnFailure.enabled"),
+  DeprecatedConfig("spark.network.remoteReadNioBufferConversion", "3.5.2",
+"Please open a JIRA ticket to report it if you need to use this 
configuration.")
 )
 
 Map(configs.map { cfg => (cfg.key -> cfg) } : _*)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (8ccc8b92be50 -> 33cac4436e59)

2024-05-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 8ccc8b92be50 [SPARK-48201][DOCS][PYTHON] Make some corrections in the 
docstring of pyspark DataStreamReader methods
 add 33cac4436e59 [SPARK-47847][CORE] Deprecate 
`spark.network.remoteReadNioBufferConversion`

No new revisions were added by this update.

Summary of changes:
 core/src/main/scala/org/apache/spark/SparkConf.scala | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48201][DOCS][PYTHON] Make some corrections in the docstring of pyspark DataStreamReader methods

2024-05-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8ccc8b92be50 [SPARK-48201][DOCS][PYTHON] Make some corrections in the 
docstring of pyspark DataStreamReader methods
8ccc8b92be50 is described below

commit 8ccc8b92be50b1d5ef932873403e62e28c478781
Author: Chloe He 
AuthorDate: Thu May 9 22:07:04 2024 -0700

[SPARK-48201][DOCS][PYTHON] Make some corrections in the docstring of 
pyspark DataStreamReader methods

### What changes were proposed in this pull request?

The docstrings of the pyspark DataStream Reader methods `csv()` and 
`text()` say that the `path` parameter can be a list, but actually when a list 
is passed an error is raised.

### Why are the changes needed?

Documentation is wrong.

### Does this PR introduce _any_ user-facing change?

Yes. Fixes documentation.

### How was this patch tested?

N/A

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46416 from chloeh13q/fix/streamread-docstring.

Authored-by: Chloe He 
Signed-off-by: Dongjoon Hyun 
---
 python/pyspark/sql/streaming/readwriter.py | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/python/pyspark/sql/streaming/readwriter.py 
b/python/pyspark/sql/streaming/readwriter.py
index c2b75dd8f167..b202a499e8b0 100644
--- a/python/pyspark/sql/streaming/readwriter.py
+++ b/python/pyspark/sql/streaming/readwriter.py
@@ -553,8 +553,8 @@ class DataStreamReader(OptionUtils):
 
 Parameters
 --
-path : str or list
-string, or list of strings, for input path(s).
+path : str
+string for input path.
 
 Other Parameters
 
@@ -641,8 +641,8 @@ class DataStreamReader(OptionUtils):
 
 Parameters
 --
-path : str or list
-string, or list of strings, for input path(s).
+path : str
+string for input path.
 schema : :class:`pyspark.sql.types.StructType` or str, optional
 an optional :class:`pyspark.sql.types.StructType` for the input 
schema
 or a DDL-formatted string (For example ``col0 INT, col1 DOUBLE``).


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48228][PYTHON][CONNECT] Implement the missing function validation in ApplyInXXX

2024-05-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9bb15db85e53 [SPARK-48228][PYTHON][CONNECT] Implement the missing 
function validation in ApplyInXXX
9bb15db85e53 is described below

commit 9bb15db85e53b69b9c0ba112cd1dd93d8213eea4
Author: Ruifeng Zheng 
AuthorDate: Thu May 9 22:01:13 2024 -0700

[SPARK-48228][PYTHON][CONNECT] Implement the missing function validation in 
ApplyInXXX

### What changes were proposed in this pull request?
Implement the missing function validation in ApplyInXXX

https://github.com/apache/spark/pull/46397 fixed this issue for 
`Cogrouped.ApplyInPandas`, this PR fix remaining methods.

### Why are the changes needed?
for better error message:

```
In [12]: df1 = spark.range(11)

In [13]: df2 = df1.groupby("id").applyInPandas(lambda: 1, 
StructType([StructField("d", DoubleType())]))

In [14]: df2.show()
```

before this PR, an invalid function causes weird execution errors:
```
24/05/10 11:37:36 ERROR Executor: Exception in task 0.0 in stage 10.0 (TID 
36)
org.apache.spark.api.python.PythonException: Traceback (most recent call 
last):
  File 
"/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1834, in main
process()
  File 
"/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1826, in process
serializer.dump_stream(out_iter, outfile)
  File 
"/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
 line 531, in dump_stream
return ArrowStreamSerializer.dump_stream(self, 
init_stream_yield_batches(), stream)
   

  File 
"/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
 line 104, in dump_stream
for batch in iterator:
  File 
"/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
 line 524, in init_stream_yield_batches
for series in iterator:
  File 
"/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1610, in mapper
return f(keys, vals)
   ^
  File 
"/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
488, in 
return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))]
  ^
  File 
"/Users/ruifeng.zheng/Dev/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
483, in wrapped
result, return_type, _assign_cols_by_name, truncate_return_schema=False
^^
UnboundLocalError: cannot access local variable 'result' where it is not 
associated with a value

at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:523)
at 
org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:117)
at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:479)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:601)
at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:896)

...
```

After this PR, the error happens before execution, which is consistent with 
Spark Classic, and
 much clear
```
PySparkValueError: [INVALID_PANDAS_UDF] Invalid function: pandas_udf with 
function type GROUPED_MAP or the function in groupby.applyInPandas must take 
either one argument (data) or two arguments (key, data).

```

### Does this PR introduce _any_ user-facing change?
yes, error message changes

### How was this patch tested?
added tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46519 from zhengruifeng/missing_check_in_group.

Authored-by: Ruifeng Zheng 
Signed-off-by: Dongjoon Hyun 
---

(spark) branch master updated: [SPARK-48224][SQL] Disallow map keys from being of variant type

2024-05-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b371e7dd8800 [SPARK-48224][SQL] Disallow map keys from being of 
variant type
b371e7dd8800 is described below

commit b371e7dd88009195740f8f5b591447441ea43d0b
Author: Harsh Motwani 
AuthorDate: Thu May 9 21:47:05 2024 -0700

[SPARK-48224][SQL] Disallow map keys from being of variant type

### What changes were proposed in this pull request?

This PR disallows map keys from being of variant type. Therefore, SQL 
statements like `select map(parse_json('{"a": 1}'), 1)`, which would work 
earlier, will throw an exception now.

### Why are the changes needed?

Allowing variant to be the key type of a map can result in undefined 
behavior as this has not been tested.

### Does this PR introduce _any_ user-facing change?

Yes, users could use variants as keys in maps earlier. However, this PR 
disallows this possibility.

### How was this patch tested?

Unit tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46516 from harshmotw-db/map_variant_key.

Authored-by: Harsh Motwani 
Signed-off-by: Dongjoon Hyun 
---
 .../apache/spark/sql/catalyst/util/TypeUtils.scala |  2 +-
 .../catalyst/expressions/ComplexTypeSuite.scala| 34 +-
 2 files changed, 34 insertions(+), 2 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala
index d2c708b380cf..a0d578c66e73 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala
@@ -58,7 +58,7 @@ object TypeUtils extends QueryErrorsBase {
   }
 
   def checkForMapKeyType(keyType: DataType): TypeCheckResult = {
-if (keyType.existsRecursively(_.isInstanceOf[MapType])) {
+if (keyType.existsRecursively(dt => dt.isInstanceOf[MapType] || 
dt.isInstanceOf[VariantType])) {
   DataTypeMismatch(
 errorSubClass = "INVALID_MAP_KEY_TYPE",
 messageParameters = Map(
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala
index 5f135e46a377..497b335289b1 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala
@@ -28,7 +28,7 @@ import org.apache.spark.sql.catalyst.util._
 import org.apache.spark.sql.catalyst.util.TypeUtils.ordinalNumber
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types._
-import org.apache.spark.unsafe.types.UTF8String
+import org.apache.spark.unsafe.types.{UTF8String, VariantVal}
 
 class ComplexTypeSuite extends SparkFunSuite with ExpressionEvalHelper {
 
@@ -359,6 +359,38 @@ class ComplexTypeSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 )
   }
 
+  // map key can't be variant
+  val map6 = CreateMap(Seq(
+Literal.create(new VariantVal(Array[Byte](), Array[Byte]())),
+Literal.create(1)
+  ))
+  map6.checkInputDataTypes() match {
+case TypeCheckResult.TypeCheckSuccess => fail("should not allow variant as 
a part of map key")
+case TypeCheckResult.DataTypeMismatch(errorSubClass, messageParameters) =>
+  assert(errorSubClass == "INVALID_MAP_KEY_TYPE")
+  assert(messageParameters === Map("keyType" -> "\"VARIANT\""))
+  }
+
+  // map key can't contain variant
+  val map7 = CreateMap(
+Seq(
+  CreateStruct(
+Seq(Literal.create(1), Literal.create(new VariantVal(Array[Byte](), 
Array[Byte](
+  ),
+  Literal.create(1)
+)
+  )
+  map7.checkInputDataTypes() match {
+case TypeCheckResult.TypeCheckSuccess => fail("should not allow variant as 
a part of map key")
+case TypeCheckResult.DataTypeMismatch(errorSubClass, messageParameters) =>
+  assert(errorSubClass == "INVALID_MAP_KEY_TYPE")
+  assert(
+messageParameters === Map(
+  "keyType" -> "\"STRUCT\""
+)
+  )
+  }
+
   test("MapFromArrays") {
 val intSeq = Seq(5, 10, 15, 20, 25)
 val longSeq = intSeq.map(_.toLong)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-47018][BUILD][SQL] Bump built-in Hive to 2.3.10

2024-05-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2d609bfd37ae [SPARK-47018][BUILD][SQL] Bump built-in Hive to 2.3.10
2d609bfd37ae is described below

commit 2d609bfd37ae9a0877fb72d1ba0479bb04a2dad6
Author: Cheng Pan 
AuthorDate: Thu May 9 21:31:50 2024 -0700

[SPARK-47018][BUILD][SQL] Bump built-in Hive to 2.3.10

### What changes were proposed in this pull request?

This PR aims to bump Spark's built-in Hive from 2.3.9 to Hive 2.3.10, with 
two additional changes:

- due to API breaking changes of Thrift, `libthrift` is upgraded from 
`0.12` to `0.16`.
- remove version management of `commons-lang:2.6`, it comes from Hive 
transitive deps, Hive 2.3.10 drops it in 
https://github.com/apache/hive/pull/4892

This is the first part of https://github.com/apache/spark/pull/45372

### Why are the changes needed?

Bump Hive to the latest version of 2.3, prepare for upgrading Guava, and 
dropping vulnerable dependencies like Jackson 1.x / Jodd

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA. (wait for sunchao to complete the 2.3.10 release to make jars 
visible on Maven Central)

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45372

Closes #46468 from pan3793/SPARK-47018.

Lead-authored-by: Cheng Pan 
Co-authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 connector/kafka-0-10-assembly/pom.xml  |  5 
 connector/kinesis-asl-assembly/pom.xml |  5 
 dev/deps/spark-deps-hadoop-3-hive-2.3  | 27 +--
 docs/building-spark.md |  4 +--
 docs/sql-data-sources-hive-tables.md   |  8 +++---
 docs/sql-migration-guide.md|  2 +-
 pom.xml| 31 +-
 .../hive/service/auth/KerberosSaslHelper.java  |  5 ++--
 .../apache/hive/service/auth/PlainSaslHelper.java  |  3 ++-
 .../hive/service/auth/TSetIpAddressProcessor.java  |  5 ++--
 .../service/cli/thrift/ThriftBinaryCLIService.java |  6 -
 .../hive/service/cli/thrift/ThriftCLIService.java  | 10 +++
 .../org/apache/spark/sql/hive/HiveUtils.scala  |  2 +-
 .../org/apache/spark/sql/hive/client/package.scala |  5 ++--
 .../hive/HiveExternalCatalogVersionsSuite.scala|  1 -
 .../spark/sql/hive/HiveSparkSubmitSuite.scala  | 10 +++
 .../spark/sql/hive/execution/HiveQuerySuite.scala  |  6 ++---
 17 files changed, 61 insertions(+), 74 deletions(-)

diff --git a/connector/kafka-0-10-assembly/pom.xml 
b/connector/kafka-0-10-assembly/pom.xml
index b2fcbdf8eca7..bd311b3a9804 100644
--- a/connector/kafka-0-10-assembly/pom.xml
+++ b/connector/kafka-0-10-assembly/pom.xml
@@ -54,11 +54,6 @@
   commons-codec
   provided
 
-
-  commons-lang
-  commons-lang
-  provided
-
 
   com.google.protobuf
   protobuf-java
diff --git a/connector/kinesis-asl-assembly/pom.xml 
b/connector/kinesis-asl-assembly/pom.xml
index 577ec2153083..0e93526fce72 100644
--- a/connector/kinesis-asl-assembly/pom.xml
+++ b/connector/kinesis-asl-assembly/pom.xml
@@ -54,11 +54,6 @@
   jackson-databind
   provided
 
-
-  commons-lang
-  commons-lang
-  provided
-
 
   org.glassfish.jersey.core
   jersey-client
diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index 73d41e9eeb33..392bacd73277 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -46,7 +46,6 @@ commons-compress/1.26.1//commons-compress-1.26.1.jar
 commons-crypto/1.1.0//commons-crypto-1.1.0.jar
 commons-dbcp/1.4//commons-dbcp-1.4.jar
 commons-io/2.16.1//commons-io-2.16.1.jar
-commons-lang/2.6//commons-lang-2.6.jar
 commons-lang3/3.14.0//commons-lang3-3.14.0.jar
 commons-math3/3.6.1//commons-math3-3.6.1.jar
 commons-pool/1.5.4//commons-pool-1.5.4.jar
@@ -81,19 +80,19 @@ hadoop-cloud-storage/3.4.0//hadoop-cloud-storage-3.4.0.jar
 hadoop-huaweicloud/3.4.0//hadoop-huaweicloud-3.4.0.jar
 hadoop-shaded-guava/1.2.0//hadoop-shaded-guava-1.2.0.jar
 hadoop-yarn-server-web-proxy/3.4.0//hadoop-yarn-server-web-proxy-3.4.0.jar
-hive-beeline/2.3.9//hive-beeline-2.3.9.jar
-hive-cli/2.3.9//hive-cli-2.3.9.jar
-hive-common/2.3.9//hive-common-2.3.9.jar
-hive-exec/2.3.9/core/hive-exec-2.3.9-core.jar
-hive-jdbc/2.3.9//hive-jdbc-2.3.9.jar
-hive-llap-common/2.3.9//hive-llap-common-2.3.9.jar
-hive-metastore/2.3.9//hive-metastore-2.3.9.jar
-hive-serde/2.3.9//hive-serde-2.3.9.jar
+hive-beeline/2.3.10//hive-beeline-2.3.10.jar
+hive-cli/2.3.10//hive-cli-2.3.10.jar

(spark) branch master updated: [MINOR][BUILD] Remove duplicate configuration of maven-compiler-plugin

2024-05-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 1138b2a68b54 [MINOR][BUILD] Remove duplicate configuration of 
maven-compiler-plugin
1138b2a68b54 is described below

commit 1138b2a68b5408e6d079bdbce8026323694628e5
Author: zml1206 
AuthorDate: Thu May 9 20:51:32 2024 -0700

[MINOR][BUILD] Remove duplicate configuration of maven-compiler-plugin

### What changes were proposed in this pull request?
`${java.version}` and 
`${java.version}` 
(https://github.com/apache/spark/pull/46024/files#diff-9c5fb3d1b7e3b0f54bc5c4182965c4fe1f9023d449017cece3005d3f90e8e4d8R117)
are equivalent duplicate configuration, so remove 
`${java.version}`.

https://maven.apache.org/plugins/maven-compiler-plugin/examples/set-compiler-release.html

### Why are the changes needed?
Simplify the code and facilitates subsequent configuration iterations.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46024 from zml1206/remove_duplicate_configuration.

Authored-by: zml1206 
Signed-off-by: Dongjoon Hyun 
---
 pom.xml | 1 -
 1 file changed, 1 deletion(-)

diff --git a/pom.xml b/pom.xml
index c3ff5d101c22..678455e6e248 100644
--- a/pom.xml
+++ b/pom.xml
@@ -3127,7 +3127,6 @@
   maven-compiler-plugin
   3.13.0
   
-${java.version}
 true 
 true 
   


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-47834][SQL][CONNECT] Mark deprecated functions with `@deprecated` in `SQLImplicits`

2024-05-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 32b2827b964b [SPARK-47834][SQL][CONNECT] Mark deprecated functions 
with `@deprecated` in `SQLImplicits`
32b2827b964b is described below

commit 32b2827b964bd4a4accb60b47ddd6929f41d4a89
Author: YangJie 
AuthorDate: Thu May 9 20:47:34 2024 -0700

[SPARK-47834][SQL][CONNECT] Mark deprecated functions with `@deprecated` in 
`SQLImplicits`

### What changes were proposed in this pull request?
In the `sql` module, some functions in `SQLImplicits` have already been 
marked as `deprecated` in the function comments after SPARK-19089.

This pr adds `deprecated` type annotation marks to them. Since SPARK-19089 
occurred in Spark 2.2.0, the `since` field of `deprecated` is filled in as 
`2.2.0`.

At the same time, these `deprecated` marks have also been synchronized to 
the corresponding functions in `SQLImplicits` in the `connect` module.

### Why are the changes needed?
Mark deprecated functions with `deprecated` in `SQLImplicits`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass Github Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46029 from LuciferYang/deprecated-SQLImplicits.

Lead-authored-by: YangJie 
Co-authored-by: yangjie01 
Signed-off-by: Dongjoon Hyun 
---
 .../jvm/src/main/scala/org/apache/spark/sql/SQLImplicits.scala   | 9 +
 sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala  | 9 +
 2 files changed, 18 insertions(+)

diff --git 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SQLImplicits.scala
 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SQLImplicits.scala
index 6c626fd716d5..7799d395d5c6 100644
--- 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SQLImplicits.scala
+++ 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SQLImplicits.scala
@@ -149,6 +149,7 @@ abstract class SQLImplicits private[sql] (session: 
SparkSession) extends LowPrio
* @deprecated
*   use [[newSequenceEncoder]]
*/
+  @deprecated("Use newSequenceEncoder instead", "2.2.0")
   val newIntSeqEncoder: Encoder[Seq[Int]] = newSeqEncoder(PrimitiveIntEncoder)
 
   /**
@@ -156,6 +157,7 @@ abstract class SQLImplicits private[sql] (session: 
SparkSession) extends LowPrio
* @deprecated
*   use [[newSequenceEncoder]]
*/
+  @deprecated("Use newSequenceEncoder instead", "2.2.0")
   val newLongSeqEncoder: Encoder[Seq[Long]] = 
newSeqEncoder(PrimitiveLongEncoder)
 
   /**
@@ -163,6 +165,7 @@ abstract class SQLImplicits private[sql] (session: 
SparkSession) extends LowPrio
* @deprecated
*   use [[newSequenceEncoder]]
*/
+  @deprecated("Use newSequenceEncoder instead", "2.2.0")
   val newDoubleSeqEncoder: Encoder[Seq[Double]] = 
newSeqEncoder(PrimitiveDoubleEncoder)
 
   /**
@@ -170,6 +173,7 @@ abstract class SQLImplicits private[sql] (session: 
SparkSession) extends LowPrio
* @deprecated
*   use [[newSequenceEncoder]]
*/
+  @deprecated("Use newSequenceEncoder instead", "2.2.0")
   val newFloatSeqEncoder: Encoder[Seq[Float]] = 
newSeqEncoder(PrimitiveFloatEncoder)
 
   /**
@@ -177,6 +181,7 @@ abstract class SQLImplicits private[sql] (session: 
SparkSession) extends LowPrio
* @deprecated
*   use [[newSequenceEncoder]]
*/
+  @deprecated("Use newSequenceEncoder instead", "2.2.0")
   val newByteSeqEncoder: Encoder[Seq[Byte]] = 
newSeqEncoder(PrimitiveByteEncoder)
 
   /**
@@ -184,6 +189,7 @@ abstract class SQLImplicits private[sql] (session: 
SparkSession) extends LowPrio
* @deprecated
*   use [[newSequenceEncoder]]
*/
+  @deprecated("Use newSequenceEncoder instead", "2.2.0")
   val newShortSeqEncoder: Encoder[Seq[Short]] = 
newSeqEncoder(PrimitiveShortEncoder)
 
   /**
@@ -191,6 +197,7 @@ abstract class SQLImplicits private[sql] (session: 
SparkSession) extends LowPrio
* @deprecated
*   use [[newSequenceEncoder]]
*/
+  @deprecated("Use newSequenceEncoder instead", "2.2.0")
   val newBooleanSeqEncoder: Encoder[Seq[Boolean]] = 
newSeqEncoder(PrimitiveBooleanEncoder)
 
   /**
@@ -198,6 +205,7 @@ abstract class SQLImplicits private[sql] (session: 
SparkSession) extends LowPrio
* @deprecated
*   use [[newSequenceEncoder]]
*/
+  @deprecated("Use newSequenceEncoder instead", "2.2.0")
   val newStringSeqEncoder: Encoder[Seq[String]] = newSeqEncoder(StringEncoder)
 
   /**
@@ -205,6 +213,7 @@ abstract class SQLImplicits private[sql] (session: 
SparkSession) extends LowPrio
* @deprecated
*   use [[newSequenceEncoder]]
*/
+  @deprecated("Use newSequenceEncoder instead", "2.2.0")
   def newProductSeqEncoder[A <:

(spark) branch master updated: [SPARK-48176][SQL] Adjust name of FIELD_ALREADY_EXISTS error condition

2024-05-09 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a41d0ae79b43 [SPARK-48176][SQL] Adjust name of FIELD_ALREADY_EXISTS 
error condition
a41d0ae79b43 is described below

commit a41d0ae79b432e2757379fc56a0ad2755f02e871
Author: Nicholas Chammas 
AuthorDate: Fri May 10 12:23:34 2024 +0900

[SPARK-48176][SQL] Adjust name of FIELD_ALREADY_EXISTS error condition

### What changes were proposed in this pull request?

Rename `FIELDS_ALREADY_EXISTS` to `FIELD_ALREADY_EXISTS`.

### Why are the changes needed?

Though it's not meant to be a proper English sentence, 
`FIELDS_ALREADY_EXISTS` is grammatically incorrect. It should either be "fields 
already exist[]" or "field[] already exists". I opted for the latter.

### Does this PR introduce _any_ user-facing change?

Yes, it changes the name of an error condition.

### How was this patch tested?

CI only.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46510 from nchammas/SPARK-48176-field-exists-error.

Authored-by: Nicholas Chammas 
Signed-off-by: Hyukjin Kwon 
---
 common/utils/src/main/resources/error/error-conditions.json   | 2 +-
 .../src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala  | 4 ++--
 .../scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala  | 2 +-
 .../test/scala/org/apache/spark/sql/connector/AlterTableTests.scala   | 4 ++--
 .../apache/spark/sql/connector/V2CommandsCaseSensitivitySuite.scala   | 4 ++--
 .../sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala | 4 ++--
 6 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/common/utils/src/main/resources/error/error-conditions.json 
b/common/utils/src/main/resources/error/error-conditions.json
index 8a64c4c590e8..7c9886c749b9 100644
--- a/common/utils/src/main/resources/error/error-conditions.json
+++ b/common/utils/src/main/resources/error/error-conditions.json
@@ -1339,7 +1339,7 @@
 ],
 "sqlState" : "54001"
   },
-  "FIELDS_ALREADY_EXISTS" : {
+  "FIELD_ALREADY_EXISTS" : {
 "message" : [
   "Cannot  column, because  already exists in ."
 ],
diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala
index c80fbfc748dd..b60107f90283 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala
@@ -107,7 +107,7 @@ private[v2] trait V2JDBCTest extends SharedSparkSession 
with DockerIntegrationFu
 exception = intercept[AnalysisException] {
   sql(s"ALTER TABLE $catalogName.alt_table ADD COLUMNS (C3 DOUBLE)")
 },
-errorClass = "FIELDS_ALREADY_EXISTS",
+errorClass = "FIELD_ALREADY_EXISTS",
 parameters = Map(
   "op" -> "add",
   "fieldNames" -> "`C3`",
@@ -179,7 +179,7 @@ private[v2] trait V2JDBCTest extends SharedSparkSession 
with DockerIntegrationFu
 exception = intercept[AnalysisException] {
   sql(s"ALTER TABLE $catalogName.alt_table RENAME COLUMN ID1 TO ID2")
 },
-errorClass = "FIELDS_ALREADY_EXISTS",
+errorClass = "FIELD_ALREADY_EXISTS",
 parameters = Map(
   "op" -> "rename",
   "fieldNames" -> "`ID2`",
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
index e55f23b6aa86..e18f4d1b36e1 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
@@ -1403,7 +1403,7 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog with QueryErrorsB
   if (struct.findNestedField(
   fieldNames, includeCollections = true, 
alter.conf.resolver).isDefined) {
 alter.failAnalysis(
-  errorClass = "FIELDS_ALREADY_EXISTS",
+  errorClass = "FIELD_ALREADY_EXISTS",
   messageParameters = Map(
 "op" -> op,
 "fieldNames" -> toSQLId(fieldNames),
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala
index 996d7acb1148..28605958c71d 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala
@@ -466,7 +466,7 @@

(spark) branch master updated: [SPARK-48222][INFRA][DOCS] Sync Ruby Bundler to 2.4.22 and refresh Gem lock file

2024-05-09 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9a2818820f11 [SPARK-48222][INFRA][DOCS] Sync Ruby Bundler to 2.4.22 
and refresh Gem lock file
9a2818820f11 is described below

commit 9a2818820f11f9bdcc042f4ab80850918911c68c
Author: Nicholas Chammas 
AuthorDate: Fri May 10 09:58:16 2024 +0800

[SPARK-48222][INFRA][DOCS] Sync Ruby Bundler to 2.4.22 and refresh Gem lock 
file

### What changes were proposed in this pull request?

Sync the version of Bundler that we are using across various scripts and 
documentation. Also refresh the Gem lock file.

### Why are the changes needed?

We are seeing inconsistent build behavior, likely due to the inconsistent 
Bundler versions.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI + the preview release process.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46512 from nchammas/bundler-sync.

Authored-by: Nicholas Chammas 
Signed-off-by: Wenchen Fan 
---
 .github/workflows/build_and_test.yml   |  3 +++
 dev/create-release/spark-rm/Dockerfile |  2 +-
 docs/Gemfile.lock  | 16 
 docs/README.md |  2 +-
 4 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 4a11823aee60..881fb8cb0674 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -872,6 +872,9 @@ jobs:
 python3.9 -m pip install 'docutils<0.18.0' # See SPARK-39421
 - name: Install dependencies for documentation generation
   run: |
+# Keep the version of Bundler here in sync with the following 
locations:
+#   - dev/create-release/spark-rm/Dockerfile
+#   - docs/README.md
 gem install bundler -v 2.4.22
 cd docs
 bundle install
diff --git a/dev/create-release/spark-rm/Dockerfile 
b/dev/create-release/spark-rm/Dockerfile
index 8d5ca38ba88e..13f4112ca03d 100644
--- a/dev/create-release/spark-rm/Dockerfile
+++ b/dev/create-release/spark-rm/Dockerfile
@@ -38,7 +38,7 @@ ENV DEBCONF_NONINTERACTIVE_SEEN true
 ARG APT_INSTALL="apt-get install --no-install-recommends -y"
 
 ARG PIP_PKGS="sphinx==4.5.0 mkdocs==1.1.2 numpy==1.20.3 
pydata_sphinx_theme==0.13.3 ipython==7.19.0 nbsphinx==0.8.0 numpydoc==1.1.0 
jinja2==3.1.2 twine==3.4.1 sphinx-plotly-directive==0.1.3 
sphinx-copybutton==0.5.2 pandas==2.0.3 pyarrow==10.0.1 plotly==5.4.0 
markupsafe==2.0.1 docutils<0.17 grpcio==1.62.0 protobuf==4.21.6 
grpcio-status==1.62.0 googleapis-common-protos==1.56.4"
-ARG GEM_PKGS="bundler:2.3.8"
+ARG GEM_PKGS="bundler:2.4.22"
 
 # Install extra needed repos and refresh.
 # - CRAN repo
diff --git a/docs/Gemfile.lock b/docs/Gemfile.lock
index 4e38f18703f3..e137f0f039b9 100644
--- a/docs/Gemfile.lock
+++ b/docs/Gemfile.lock
@@ -4,16 +4,16 @@ GEM
 addressable (2.8.6)
   public_suffix (>= 2.0.2, < 6.0)
 colorator (1.1.0)
-concurrent-ruby (1.2.2)
+concurrent-ruby (1.2.3)
 em-websocket (0.5.3)
   eventmachine (>= 0.12.9)
   http_parser.rb (~> 0)
 eventmachine (1.2.7)
 ffi (1.16.3)
 forwardable-extended (2.6.0)
-google-protobuf (3.25.2)
+google-protobuf (3.25.3)
 http_parser.rb (0.8.0)
-i18n (1.14.1)
+i18n (1.14.5)
   concurrent-ruby (~> 1.0)
 jekyll (4.3.3)
   addressable (~> 2.4)
@@ -42,22 +42,22 @@ GEM
 kramdown-parser-gfm (1.1.0)
   kramdown (~> 2.0)
 liquid (4.0.4)
-listen (3.8.0)
+listen (3.9.0)
   rb-fsevent (~> 0.10, >= 0.10.3)
   rb-inotify (~> 0.9, >= 0.9.10)
 mercenary (0.4.0)
 pathutil (0.16.2)
   forwardable-extended (~> 2.6)
-public_suffix (5.0.4)
-rake (13.1.0)
+public_suffix (5.0.5)
+rake (13.2.1)
 rb-fsevent (0.11.2)
 rb-inotify (0.10.1)
   ffi (~> 1.0)
 rexml (3.2.6)
 rouge (3.30.0)
 safe_yaml (1.0.5)
-sass-embedded (1.69.7)
-  google-protobuf (~> 3.25)
+sass-embedded (1.63.6)
+  google-protobuf (~> 3.23)
   rake (>= 13.0.0)
 terminal-table (3.0.2)
   unicode-display_width (>= 1.1.1, < 3)
diff --git a/docs/README.md b/docs/README.md
index 414c8dbd8303..363f1c207636 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -36,7 +36,7 @@ You need to have [Ruby 3][ruby] and [Python 3][python] 
installed. Make sure the
 [python]: https://www.python.org/downloads/
 
 ```sh
-$ gem install bundler
+$ gem install bundler -v 2.4.22
 ```
 
 After this all the required Ruby dependencies can be installed from the 
`docs/` directory via Bundler:


-
To

(spark) branch master updated: [SPARK-48227][PYTHON][DOC] Document the requirement of seed in protos

2024-05-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 012d19d8e9b2 [SPARK-48227][PYTHON][DOC] Document the requirement of 
seed in protos
012d19d8e9b2 is described below

commit 012d19d8e9b28f7ce266753bcfff4a76c9510245
Author: Ruifeng Zheng 
AuthorDate: Thu May 9 16:58:44 2024 -0700

[SPARK-48227][PYTHON][DOC] Document the requirement of seed in protos

### What changes were proposed in this pull request?
Document the requirement of seed in protos

### Why are the changes needed?
the seed should be set at client side

document it to avoid cases like https://github.com/apache/spark/pull/46456

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46518 from zhengruifeng/doc_random.

Authored-by: Ruifeng Zheng 
Signed-off-by: Dongjoon Hyun 
---
 .../common/src/main/protobuf/spark/connect/relations.proto |  8 ++--
 python/pyspark/sql/connect/plan.py | 10 --
 python/pyspark/sql/connect/proto/relations_pb2.pyi | 10 --
 3 files changed, 18 insertions(+), 10 deletions(-)

diff --git 
a/connector/connect/common/src/main/protobuf/spark/connect/relations.proto 
b/connector/connect/common/src/main/protobuf/spark/connect/relations.proto
index 3882b2e85396..0b3c9d4253e8 100644
--- a/connector/connect/common/src/main/protobuf/spark/connect/relations.proto
+++ b/connector/connect/common/src/main/protobuf/spark/connect/relations.proto
@@ -467,7 +467,9 @@ message Sample {
   // (Optional) Whether to sample with replacement.
   optional bool with_replacement = 4;
 
-  // (Optional) The random seed.
+  // (Required) The random seed.
+  // This filed is required to avoid generate mutable dataframes (see 
SPARK-48184 for details),
+  // however, still keep it 'optional' here for backward compatibility.
   optional int64 seed = 5;
 
   // (Required) Explicitly sort the underlying plan to make the ordering 
deterministic or cache it.
@@ -687,7 +689,9 @@ message StatSampleBy {
   // If a stratum is not specified, we treat its fraction as zero.
   repeated Fraction fractions = 3;
 
-  // (Optional) The random seed.
+  // (Required) The random seed.
+  // This filed is required to avoid generate mutable dataframes (see 
SPARK-48184 for details),
+  // however, still keep it 'optional' here for backward compatibility.
   optional int64 seed = 5;
 
   message Fraction {
diff --git a/python/pyspark/sql/connect/plan.py 
b/python/pyspark/sql/connect/plan.py
index 4ac4946745f5..3d3303fb15c5 100644
--- a/python/pyspark/sql/connect/plan.py
+++ b/python/pyspark/sql/connect/plan.py
@@ -717,7 +717,7 @@ class Sample(LogicalPlan):
 lower_bound: float,
 upper_bound: float,
 with_replacement: bool,
-seed: Optional[int],
+seed: int,
 deterministic_order: bool = False,
 ) -> None:
 super().__init__(child)
@@ -734,8 +734,7 @@ class Sample(LogicalPlan):
 plan.sample.lower_bound = self.lower_bound
 plan.sample.upper_bound = self.upper_bound
 plan.sample.with_replacement = self.with_replacement
-if self.seed is not None:
-plan.sample.seed = self.seed
+plan.sample.seed = self.seed
 plan.sample.deterministic_order = self.deterministic_order
 return plan
 
@@ -1526,7 +1525,7 @@ class StatSampleBy(LogicalPlan):
 child: Optional["LogicalPlan"],
 col: Column,
 fractions: Sequence[Tuple[Column, float]],
-seed: Optional[int],
+seed: int,
 ) -> None:
 super().__init__(child)
 
@@ -1554,8 +1553,7 @@ class StatSampleBy(LogicalPlan):
 fraction.stratum.CopyFrom(k.to_plan(session).literal)
 fraction.fraction = float(v)
 plan.sample_by.fractions.append(fraction)
-if self._seed is not None:
-plan.sample_by.seed = self._seed
+plan.sample_by.seed = self._seed
 return plan
 
 
diff --git a/python/pyspark/sql/connect/proto/relations_pb2.pyi 
b/python/pyspark/sql/connect/proto/relations_pb2.pyi
index 5dfb47da67a9..9b6f4b43544f 100644
--- a/python/pyspark/sql/connect/proto/relations_pb2.pyi
+++ b/python/pyspark/sql/connect/proto/relations_pb2.pyi
@@ -1865,7 +1865,10 @@ class Sample(google.protobuf.message.Message):
 with_replacement: builtins.bool
 """(Optional) Whether to sample with replacement."""
 seed: builtins.int
-"""(Optional) The random seed."""
+"""(Required) The random seed.
+This filed is required to avoid generate mutable dataframes (see 
SPARK-48184 for details),
+however, still keep it 'optional' here for

(spark) branch master updated: [SPARK-48180][SQL] Improve error when UDTF call with TABLE arg forgets parentheses around multiple PARTITION/ORDER BY exprs

2024-05-09 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 71f0eda71bc1 [SPARK-48180][SQL] Improve error when UDTF call with 
TABLE arg forgets parentheses around multiple PARTITION/ORDER BY exprs
71f0eda71bc1 is described below

commit 71f0eda71bc169a5245f4412ec0957728025a66c
Author: Daniel Tenedorio 
AuthorDate: Fri May 10 08:44:07 2024 +0900

[SPARK-48180][SQL] Improve error when UDTF call with TABLE arg forgets 
parentheses around multiple PARTITION/ORDER BY exprs

### What changes were proposed in this pull request?

This PR improves the error message when a table-valued function call has a 
TABLE argument with a PARTITION BY or ORDER BY clause with more than one 
associated expression, but forgets parentheses around them.

For example:

```
SELECT * FROM testUDTF(
  TABLE(SELECT 1 AS device_id, 2 AS data_ds)
  WITH SINGLE PARTITION
  ORDER BY device_id, data_ds)
```

This query previously returned an obscure, unrelated error:

```
[UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_TABLE_ARGUMENT] 
Unsupported subquery expression: Table arguments are used in a function where 
they are not supported:
'UnresolvedTableValuedFunction [tvf], [table-argument#338 [], 
'data_ds](https://issues.apache.org/jira/browse/SPARK-48180#338%20[],%20'data_ds),
 false
   +- Project [1 AS device_id#336, 2 AS 
data_ds#337](https://issues.apache.org/jira/browse/SPARK-48180#336,%202%20AS%20data_ds#337)
  +- OneRowRelation
```

Now it returns a reasonable error:

```
The table function call includes a table argument with an invalid 
partitioning/ordering specification: the ORDER BY clause included multiple 
expressions without parentheses surrounding them; please add parentheses around 
these expressions and then retry the query again. (line 4, pos 2)

== SQL ==

SELECT * FROM testUDTF(
  TABLE(SELECT 1 AS device_id, 2 AS data_ds)
  WITH SINGLE PARTITION
--^^^
  ORDER BY device_id, data_ds)
```

### Why are the changes needed?

Here we improve error messages for common SQL syntax mistakes.

### Does this PR introduce _any_ user-facing change?

Yes, see above.

### How was this patch tested?

This PR adds test coverage.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #46451 from dtenedor/udtf-analyzer-bug.

Authored-by: Daniel Tenedorio 
Signed-off-by: Hyukjin Kwon 
---
 .../spark/sql/catalyst/parser/SqlBaseParser.g4 |  2 ++
 .../spark/sql/catalyst/parser/AstBuilder.scala | 14 
 .../sql/catalyst/parser/PlanParserSuite.scala  | 19 +++-
 .../sql/execution/python/PythonUDTFSuite.scala | 26 ++
 4 files changed, 56 insertions(+), 5 deletions(-)

diff --git 
a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4 
b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4
index 653224c5475f..249f55fa40ac 100644
--- 
a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4
+++ 
b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4
@@ -838,9 +838,11 @@ tableArgumentPartitioning
 : ((WITH SINGLE PARTITION)
 | ((PARTITION | DISTRIBUTE) BY
 (((LEFT_PAREN partition+=expression (COMMA partition+=expression)* 
RIGHT_PAREN))
+| (expression (COMMA invalidMultiPartitionExpression=expression)+)
 | partition+=expression)))
   ((ORDER | SORT) BY
 (((LEFT_PAREN sortItem (COMMA sortItem)* RIGHT_PAREN)
+| (sortItem (COMMA invalidMultiSortItem=sortItem)+)
 | sortItem)))?
 ;
 
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
index 7d2355b2f08d..326f1e7684b9 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
@@ -1638,6 +1638,20 @@ class AstBuilder extends DataTypeAstBuilder with 
SQLConfHelper with Logging {
   }
   partitionByExpressions = p.partition.asScala.map(expression).toSeq
   orderByExpressions = p.sortItem.asScala.map(visitSortItem).toSeq
+  def invalidPartitionOrOrderingExpression(clause: String): String = {
+"The table function call includes a table argument with an invalid " +
+  s"partitioning/ordering specification: the $clause clause included 
multiple " +
+  "expressions without parentheses surrounding them; please add 
parentheses around " +

(spark) branch master updated (b47d7853d92f -> e704b9e56b0c)

2024-05-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from b47d7853d92f [SPARK-48148][CORE] JSON objects should not be modified 
when read as STRING
 add e704b9e56b0c [SPARK-48226][BUILD] Add `spark-ganglia-lgpl` to 
`lint-java` & `spark-ganglia-lgpl` and `jvm-profiler` to `sbt-checkstyle`

No new revisions were added by this update.

Summary of changes:
 dev/lint-java  | 2 +-
 dev/sbt-checkstyle | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48148][CORE] JSON objects should not be modified when read as STRING

2024-05-09 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b47d7853d92f [SPARK-48148][CORE] JSON objects should not be modified 
when read as STRING
b47d7853d92f is described below

commit b47d7853d92f733791513094af04fc18ec947246
Author: Eric Maynard 
AuthorDate: Fri May 10 08:41:24 2024 +0900

[SPARK-48148][CORE] JSON objects should not be modified when read as STRING

### What changes were proposed in this pull request?

Currently, when reading a JSON like this:
```
{"a": {"b": -999.995}}
```

With the schema:
```
a STRING
```

Spark will yield a result like this:
```
{"b": -1000.0}
```

Other changes such as changes to the input string's whitespace may also 
occur. In some cases, we apply scientific notation to an input floating-point 
number when reading it as STRING.

This applies to reading JSON files (as with `spark.read.json`) as well as 
the SQL expression `from_json`.

### Why are the changes needed?

Correctness issues may occur if a field is read as a STRING and then later 
parsed (e.g. with `from_json`) after the contents have been modified.

### Does this PR introduce _any_ user-facing change?
Yes, when reading non-string fields from a JSON object using the STRING 
type, we will now extract the field exactly as it appears.

### How was this patch tested?
Added a test in `JsonSuite.scala`

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46408 from eric-maynard/SPARK-48148.

Lead-authored-by: Eric Maynard 
Co-authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 .../spark/sql/catalyst/json/JacksonParser.scala| 63 +++---
 .../org/apache/spark/sql/internal/SQLConf.scala|  9 
 .../sql/execution/datasources/json/JsonSuite.scala | 58 
 3 files changed, 122 insertions(+), 8 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
index 5e75ff6f6e1a..b2c302fbbbe3 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
@@ -24,6 +24,7 @@ import scala.collection.mutable.ArrayBuffer
 import scala.util.control.NonFatal
 
 import com.fasterxml.jackson.core._
+import org.apache.hadoop.fs.PositionedReadable
 
 import org.apache.spark.SparkUpgradeException
 import org.apache.spark.internal.Logging
@@ -275,19 +276,63 @@ class JacksonParser(
   }
   }
 
-case _: StringType =>
-  (parser: JsonParser) => parseJsonToken[UTF8String](parser, dataType) {
+case _: StringType => (parser: JsonParser) => {
+  // This must be enabled if we will retrieve the bytes directly from the 
raw content:
+  val includeSourceInLocation = 
JsonParser.Feature.INCLUDE_SOURCE_IN_LOCATION
+  val originalMask = if 
(includeSourceInLocation.enabledIn(parser.getFeatureMask)) {
+1
+  } else {
+0
+  }
+  parser.overrideStdFeatures(includeSourceInLocation.getMask, 
includeSourceInLocation.getMask)
+  val result = parseJsonToken[UTF8String](parser, dataType) {
 case VALUE_STRING =>
   UTF8String.fromString(parser.getText)
 
-case _ =>
+case other =>
   // Note that it always tries to convert the data as string without 
the case of failure.
-  val writer = new ByteArrayOutputStream()
-  Utils.tryWithResource(factory.createGenerator(writer, 
JsonEncoding.UTF8)) {
-generator => generator.copyCurrentStructure(parser)
+  val startLocation = parser.currentTokenLocation()
+  def skipAhead(): Unit = {
+other match {
+  case START_OBJECT =>
+parser.skipChildren()
+  case START_ARRAY =>
+parser.skipChildren()
+  case _ =>
+  // Do nothing in this case; we've already read the token
+}
   }
-  UTF8String.fromBytes(writer.toByteArray)
-  }
+
+  // PositionedReadable
+  startLocation.contentReference().getRawContent match {
+case byteArray: Array[Byte] if exactStringParsing =>
+  skipAhead()
+  val endLocation = parser.currentLocation.getByteOffset
+
+  UTF8String.fromBytes(
+byteArray,
+startLocation.getByteOffset.toInt,
+endLocation.toInt - (startLocation.getByteOffset.toInt))
+case positionedReadable: PositionedReadable if

(spark) branch branch-3.5 updated (da4c808be7d6 -> dc4911725baa)

2024-05-09 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


from da4c808be7d6 [SPARK-48197][SQL][TESTS][FOLLOWUP][3.5] Regenerate 
golden files
 add dc4911725baa [SPARK-48089][SS][CONNECT] Fix 3.5 <> 4.0 
StreamingQueryListener compatibility test

No new revisions were added by this update.

Summary of changes:
 .../connect/streaming/test_parity_listener.py  | 89 +++---
 1 file changed, 44 insertions(+), 45 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-48197][SQL][TESTS][FOLLOWUP][3.5] Regenerate golden files

2024-05-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new da4c808be7d6 [SPARK-48197][SQL][TESTS][FOLLOWUP][3.5] Regenerate 
golden files
da4c808be7d6 is described below

commit da4c808be7d66dc61fdcb3b41254eef77298a72c
Author: Dongjoon Hyun 
AuthorDate: Thu May 9 14:46:01 2024 -0700

[SPARK-48197][SQL][TESTS][FOLLOWUP][3.5] Regenerate golden files

### What changes were proposed in this pull request?

This PR is a follow-up to regenerate golden files for branch-3.5
- #46475

### Why are the changes needed?

To recover branch-3.5 CI.
- https://github.com/apache/spark/actions/runs/9011670853/job/24786397001
```
[info] *** 4 TESTS FAILED ***
[error] Failed: Total 3036, Failed 4, Errors 0, Passed 3032, Ignored 3
[error] Failed tests:
[error] org.apache.spark.sql.SQLQueryTestSuite
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46514 from dongjoon-hyun/SPARK-48197.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../sql-tests/analyzer-results/ansi/higher-order-functions.sql.out   | 1 -
 .../resources/sql-tests/analyzer-results/higher-order-functions.sql.out  | 1 -
 .../test/resources/sql-tests/results/ansi/higher-order-functions.sql.out | 1 -
 .../src/test/resources/sql-tests/results/higher-order-functions.sql.out  | 1 -
 4 files changed, 4 deletions(-)

diff --git 
a/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/higher-order-functions.sql.out
 
b/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/higher-order-functions.sql.out
index 3fafb9858e5a..8fe6e7097e67 100644
--- 
a/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/higher-order-functions.sql.out
+++ 
b/sql/core/src/test/resources/sql-tests/analyzer-results/ansi/higher-order-functions.sql.out
@@ -40,7 +40,6 @@ select ceil(x -> x) as v
 org.apache.spark.sql.AnalysisException
 {
   "errorClass" : "INVALID_LAMBDA_FUNCTION_CALL.NON_HIGHER_ORDER_FUNCTION",
-  "sqlState" : "42K0D",
   "messageParameters" : {
 "class" : 
"org.apache.spark.sql.catalyst.expressions.CeilExpressionBuilder$"
   },
diff --git 
a/sql/core/src/test/resources/sql-tests/analyzer-results/higher-order-functions.sql.out
 
b/sql/core/src/test/resources/sql-tests/analyzer-results/higher-order-functions.sql.out
index d9e88ac618aa..d85101986078 100644
--- 
a/sql/core/src/test/resources/sql-tests/analyzer-results/higher-order-functions.sql.out
+++ 
b/sql/core/src/test/resources/sql-tests/analyzer-results/higher-order-functions.sql.out
@@ -40,7 +40,6 @@ select ceil(x -> x) as v
 org.apache.spark.sql.AnalysisException
 {
   "errorClass" : "INVALID_LAMBDA_FUNCTION_CALL.NON_HIGHER_ORDER_FUNCTION",
-  "sqlState" : "42K0D",
   "messageParameters" : {
 "class" : 
"org.apache.spark.sql.catalyst.expressions.CeilExpressionBuilder$"
   },
diff --git 
a/sql/core/src/test/resources/sql-tests/results/ansi/higher-order-functions.sql.out
 
b/sql/core/src/test/resources/sql-tests/results/ansi/higher-order-functions.sql.out
index eb9c454109f0..dceb370c8388 100644
--- 
a/sql/core/src/test/resources/sql-tests/results/ansi/higher-order-functions.sql.out
+++ 
b/sql/core/src/test/resources/sql-tests/results/ansi/higher-order-functions.sql.out
@@ -40,7 +40,6 @@ struct<>
 org.apache.spark.sql.AnalysisException
 {
   "errorClass" : "INVALID_LAMBDA_FUNCTION_CALL.NON_HIGHER_ORDER_FUNCTION",
-  "sqlState" : "42K0D",
   "messageParameters" : {
 "class" : 
"org.apache.spark.sql.catalyst.expressions.CeilExpressionBuilder$"
   },
diff --git 
a/sql/core/src/test/resources/sql-tests/results/higher-order-functions.sql.out 
b/sql/core/src/test/resources/sql-tests/results/higher-order-functions.sql.out
index eb9c454109f0..dceb370c8388 100644
--- 
a/sql/core/src/test/resources/sql-tests/results/higher-order-functions.sql.out
+++ 
b/sql/core/src/test/resources/sql-tests/results/higher-order-functions.sql.out
@@ -40,7 +40,6 @@ struct<>
 org.apache.spark.sql.AnalysisException
 {
   "errorClass" : "INVALID_LAMBDA_FUNCTION_CALL.NON_HIGHER_ORDER_FUNCTION",
-  "sqlState" : "42K0D",
   "messageParameters" : {
 "class" : 
"org.apache.spark.sql.catalyst.expressions.CeilExpressionBuilder$"
   },


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r69065 - /dev/spark/v4.0.0-preview1-rc1-bin/

2024-05-09 Thread wenchen

Author: wenchen
Date: Thu May  9 16:31:11 2024
New Revision: 69065

Log:
Apache Spark v4.0.0-preview1-rc1

Added:
dev/spark/v4.0.0-preview1-rc1-bin/
dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz   (with props)
dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.asc
dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.sha512
dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz   (with 
props)
dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz.asc
dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz.sha512
dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-hadoop3.tgz   
(with props)
dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-hadoop3.tgz.asc

dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-hadoop3.tgz.sha512

dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-without-hadoop.tgz   
(with props)

dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-without-hadoop.tgz.asc

dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1-bin-without-hadoop.tgz.sha512
dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1.tgz   (with props)
dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1.tgz.asc
dev/spark/v4.0.0-preview1-rc1-bin/spark-4.0.0-preview1.tgz.sha512

Added: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz
==
Binary file - no diff available.

Propchange: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz
--
svn:mime-type = application/octet-stream

Added: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.asc
==
--- dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.asc (added)
+++ dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.asc Thu May  9 
16:31:11 2024
@@ -0,0 +1,17 @@
+-BEGIN PGP SIGNATURE-
+
+iQJHBAABCgAxFiEETclnbO+ag+mPygJ4TWYghDzYf1oFAmY8+e4THHdlbmNoZW5A
+YXBhY2hlLm9yZwAKCRBNZiCEPNh/Wv78D/9aNsBANuVpIjYr+XkWYaimRLJ5IT0Z
+qKehjJBuMBDaBMMN3iWconDHBiASQT0FTYGDBeYI72fLFSMKBna5+Lu22+KD/K6h
+V8SZxPSQsAHQABYq9ha++XXyo1Vo+msPQ0pQAblmTrSpsvSWZmC8spzb5GbKYvK5
+kxr4Qt1XnHeGNJNToqGlbl/Hc2Etg5PkPBxMPBWMh7kLknMEscMNUf87JqCIa8LG
+hMid/0lrrevEm8gkuu0ol9Vgz4P+dreKE9eCfmWOXCod04y8tJnVPs83wUOZfmKV
+dHkELaMVwz3fa40QP77gK38K5i22aUgYk6dvhB+OgtatZ5tk0Dxp3AI2OObngEUm
+4cGmQLwcses53vApwkExq427gS8td4sTE2G1D4+hSdEcm8Fj69w4Ado/DlIAHZob
+KLV15qtNOyaIapT4GxBqoeqsw7tnRmxiP8K8UxFcPV/vZC1yQKIIULigPjttZKoW
++REE2N7ZyPvbvgItwjAL8hpCeYEkd7RDa7ofHAv6icC1qSsJZ9gxFM4rJvriI4g2
+tnYEvZduGpBunhlwVb0R3kAF5XoLIZQ5qm6kyWAzioc0gxzYVc3Rd+bXjm+vmopt
+bXHOM6N2lLQwqnWlHsyjGVFugrkkRXZbQbIV6FynXpKaz5YtkUhUMkofz7mOYhBi
++1Z8nZ04B6YLbw==
+=85FX
+-END PGP SIGNATURE-

Added: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.sha512
==
--- dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.sha512 (added)
+++ dev/spark/v4.0.0-preview1-rc1-bin/pyspark-4.0.0.dev1.tar.gz.sha512 Thu May  
9 16:31:11 2024
@@ -0,0 +1 @@
+2509cf6473495b0cd5c132d87f5e1c33593fa7375ca01bcab1483093cea92bdb6ad7afc7c72095376b28fc5acdc71bb323935d17513f33ee5276c6991ff668d1
  pyspark-4.0.0.dev1.tar.gz

Added: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz
==
Binary file - no diff available.

Propchange: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz
--
svn:mime-type = application/octet-stream

Added: dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz.asc
==
--- dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz.asc 
(added)
+++ dev/spark/v4.0.0-preview1-rc1-bin/pyspark-connect-4.0.0.dev1.tar.gz.asc Thu 
May  9 16:31:11 2024
@@ -0,0 +1,17 @@
+-BEGIN PGP SIGNATURE-
+
+iQJGBAABCgAxFiEETclnbO+ag+mPygJ4TWYghDzYf1oFAmY8+fATHHdlbmNoZW5A
+YXBhY2hlLm9yZwAKCRBNZiCEPNh/WoCMD/iZjkaGTUqt3jkIjWIUzpQo+kLn8//m
+f+hwUtAguXvbMJXwBOz/Q/f+KvGk0tutsbd6rmBB6cHjH4GoZPp1x6iBitFAO47r
+kHy/0xYkb70SPQCWIGQQpRv3g0uxTmpqL9H4YcIvexkV2wXG5VSwGvbSI4596n7l
+x7M3rRmFzrxhcNIYLQdhNuat0mwuJFWe6R7Zk7UYFFishn9dNt8EOYx8vsGAuMP8
+Uy3+7oZQOAGqdQGSL7Ev4Pqve7MrrPgGXaixGukXibi707NCURnHTDcenPfoEEiQ
+Hj83I3G+JrRhtsue/103a/GnHheUgwE8oEkefnUX7qC5tSn4T8lI2KpDBv9AL1pm
+Bv0eXf5X5xEM4wvO7DCgbeEDPLg72jjt9X8zjAYx05HddvTuPjeKEL+Ga6G0ueTz
+HRXHrgd1EFZ1znPZhWiSTmeqZTXdrb6wKTYt8Y6mk1oEGL3b0qE2LNkSED+4l40u
+41MlV3pmZyjRGYZl29XZKf4isKYyjec7UbJSM5ok4zCRF0p8Gvj0EihGS4X6rYpW
+9XxwjViKMIp7DCEcWjWpO6pJ8Ygb2Snh1UTFFgtzSVAoMqUgHnBHejJ4RA4ncHu6

(spark) branch master updated: [SPARK-48216][TESTS] Remove overrides DockerJDBCIntegrationSuite.connectionTimeout to make related tests configurable

2024-05-09 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e1fb1d7e063a [SPARK-48216][TESTS] Remove overrides 
DockerJDBCIntegrationSuite.connectionTimeout to make related tests configurable
e1fb1d7e063a is described below

commit e1fb1d7e063af7e8eb6e992c800902aff6e19e15
Author: Kent Yao 
AuthorDate: Thu May 9 08:37:07 2024 -0700

[SPARK-48216][TESTS] Remove overrides 
DockerJDBCIntegrationSuite.connectionTimeout to make related tests configurable

### What changes were proposed in this pull request?

This PR removes overrides DockerJDBCIntegrationSuite.connectionTimeout to 
make related tests configurable.

### Why are the changes needed?

The db dockers might require more time to bootstrap sometimes. It shall be 
configurable to avoid failure like:

```scala
[info] org.apache.spark.sql.jdbc.DB2IntegrationSuite *** ABORTED *** (3 
minutes, 11 seconds)
[info]   The code passed to eventually never returned normally. Attempted 
96 times over 3.00399815763 minutes. Last failure message: 
[jcc][t4][2030][11211][4.33.31] A communication error occurred during 
operations on the connection's underlying socket, socket input stream,
[info]   or socket output stream.  Error location: Reply.fill() - 
insufficient data (-1).  Message: Insufficient data. ERRORCODE=-4499, 
SQLSTATE=08001. (DockerJDBCIntegrationSuite.scala:215)
[info]   org.scalatest.exceptions.TestFailedDueToTimeoutException:
[info]   at 
org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:219)
[info]   at 
org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:226)
[info]   at 
org.scalatest.concurrent.Eventually.eventually(Eventually.scala:313)
[info]   at 
org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:312)
```

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?

Passing GA

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #46505 from yaooqinn/SPARK-48216.

Authored-by: Kent Yao 
Signed-off-by: Dongjoon Hyun 
---
 .../test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala| 4 
 .../test/scala/org/apache/spark/sql/jdbc/DB2KrbIntegrationSuite.scala | 3 ---
 .../test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala | 4 
 .../test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala | 3 ---
 .../org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala| 4 
 .../scala/org/apache/spark/sql/jdbc/v2/MySQLIntegrationSuite.scala| 4 
 .../scala/org/apache/spark/sql/jdbc/v2/OracleIntegrationSuite.scala   | 4 
 7 files changed, 26 deletions(-)

diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala
index aca174cce194..4ece4d2088f4 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala
@@ -21,8 +21,6 @@ import java.math.BigDecimal
 import java.sql.{Connection, Date, Timestamp}
 import java.util.Properties
 
-import org.scalatest.time.SpanSugar._
-
 import org.apache.spark.sql.{Row, SaveMode}
 import org.apache.spark.sql.catalyst.util.DateTimeTestUtils._
 import org.apache.spark.sql.internal.SQLConf
@@ -41,8 +39,6 @@ import org.apache.spark.tags.DockerTest
 class DB2IntegrationSuite extends DockerJDBCIntegrationSuite {
   override val db = new DB2DatabaseOnDocker
 
-  override val connectionTimeout = timeout(3.minutes)
-
   override def dataPreparation(conn: Connection): Unit = {
 conn.prepareStatement("CREATE TABLE tbl (x INTEGER, y 
VARCHAR(8))").executeUpdate()
 conn.prepareStatement("INSERT INTO tbl VALUES (42,'fred')").executeUpdate()
diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2KrbIntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2KrbIntegrationSuite.scala
index abb683c06495..4899de2b2a14 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2KrbIntegrationSuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2KrbIntegrationSuite.scala
@@ -24,7 +24,6 @@ import javax.security.auth.login.Configuration
 import com.github.dockerjava.api.model.{AccessMode, Bind, ContainerConfig, 
HostConfig, Volume}
 import org.apache.hadoop.security.{SecurityUtil, UserGroupInformation}
 import

(spark) branch master updated: [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions (for UTF8_BINARY & LCASE)

2024-05-09 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 21333f8c1fc0 [SPARK-47409][SQL] Add support for collation for 
StringTrim type of functions/expressions (for UTF8_BINARY & LCASE)
21333f8c1fc0 is described below

commit 21333f8c1fc01756e6708ad6ccf21f585fcb881d
Author: David Milicevic 
AuthorDate: Thu May 9 23:05:20 2024 +0800

[SPARK-47409][SQL] Add support for collation for StringTrim type of 
functions/expressions (for UTF8_BINARY & LCASE)

Recreating [original PR](https://github.com/apache/spark/pull/45749) 
because code has been reorganized in [this 
PR](https://github.com/apache/spark/pull/45978).

### What changes were proposed in this pull request?
This PR is created to add support for collations to StringTrim family of 
functions/expressions, specifically:
- `StringTrim`
- `StringTrimBoth`
- `StringTrimLeft`
- `StringTrimRight`

Changes:
- `CollationSupport.java`
  - Add new `StringTrim`, `StringTrimLeft` and `StringTrimRight` classes 
with corresponding logic.
  - `CollationAwareUTF8String` - add new `trim`, `trimLeft` and `trimRight` 
methods that actually implement trim logic.
- `UTF8String.java` - expose some of the methods publicly.
- `stringExpressions.scala`
  - Change input types.
  - Change eval and code gen logic.
- `CollationTypeCasts.scala` - add `StringTrim*` expressions to 
`CollationTypeCasts` rules.

### Why are the changes needed?
We are incrementally adding collation support to a built-in string 
functions in Spark.

### Does this PR introduce _any_ user-facing change?
Yes:
- User should now be able to use non-default collations in string trim 
functions.

### How was this patch tested?
Already existing tests + new unit/e2e tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46206 from davidm-db/string-trim-functions.

Authored-by: David Milicevic 
Signed-off-by: Wenchen Fan 
---
 .../catalyst/util/CollationAwareUTF8String.java| 470 ++
 .../spark/sql/catalyst/util/CollationSupport.java  | 534 -
 .../org/apache/spark/unsafe/types/UTF8String.java  |   2 +-
 .../spark/unsafe/types/CollationSupportSuite.java  | 193 
 .../sql/catalyst/analysis/CollationTypeCasts.scala |   2 +-
 .../catalyst/expressions/stringExpressions.scala   |  53 +-
 .../sql/CollationStringExpressionsSuite.scala  | 161 ++-
 7 files changed, 1054 insertions(+), 361 deletions(-)

diff --git 
a/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
new file mode 100644
index ..ee0d611d7e65
--- /dev/null
+++ 
b/common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationAwareUTF8String.java
@@ -0,0 +1,470 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.util;
+
+import com.ibm.icu.lang.UCharacter;
+import com.ibm.icu.text.BreakIterator;
+import com.ibm.icu.text.StringSearch;
+import com.ibm.icu.util.ULocale;
+
+import org.apache.spark.unsafe.UTF8StringBuilder;
+import org.apache.spark.unsafe.types.UTF8String;
+
+import static org.apache.spark.unsafe.Platform.BYTE_ARRAY_OFFSET;
+import static org.apache.spark.unsafe.Platform.copyMemory;
+
+import java.util.HashMap;
+import java.util.Map;
+
+/**
+ * Utility class for collation-aware UTF8String operations.
+ */
+public class CollationAwareUTF8String {
+  public static UTF8String replace(final UTF8String src, final UTF8String 
search,
+  final UTF8String replace, final int collationId) {
+// This collation aware implementation is based on existing implementation 
on UTF8String
+if (src.numBytes() == 0 || search.numBytes() == 0) {
+  return src;
+}
+
+StringSearch stringSearch = CollationFactory.getStringSearch(src, search, 
collationId);
+
+// Find the

(spark) branch master updated: [SPARK-47803][FOLLOWUP] Check nulls when casting nested type to variant

2024-05-09 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 3fd38d4c07f6 [SPARK-47803][FOLLOWUP] Check nulls when casting nested 
type to variant
3fd38d4c07f6 is described below

commit 3fd38d4c07f6c998ec8bb234796f83a6aecfc0d2
Author: Chenhao Li 
AuthorDate: Thu May 9 22:45:10 2024 +0800

[SPARK-47803][FOLLOWUP] Check nulls when casting nested type to variant

### What changes were proposed in this pull request?

It adds null checks when accessing a nested element when casting a nested 
type to variant. It is necessary because the `get` API doesn't guarantee to 
return null when the slot is null. For example, `ColumnarArray.get` may return 
the default value of a primitive type if the slot is null.

### Why are the changes needed?

It is a bug fix is necessary for the cast-to-variant expression to work 
correctly.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Two new unit tests. One directly uses `ColumnarArray` as the input of the 
cast. The other creates a real-world situation where `ColumnarArray` is the 
input of the cast (scan). Both of them would fail without the code change in 
this PR.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46486 from chenhao-db/fix_cast_nested_to_variant.

Authored-by: Chenhao Li 
Signed-off-by: Wenchen Fan 
---
 .../variant/VariantExpressionEvalUtils.scala   |  9 --
 .../apache/spark/sql/VariantEndToEndSuite.scala| 33 --
 2 files changed, 37 insertions(+), 5 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtils.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtils.scala
index eb235eb854e0..f7f7097173bb 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtils.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtils.scala
@@ -103,7 +103,8 @@ object VariantExpressionEvalUtils {
 val offsets = new 
java.util.ArrayList[java.lang.Integer](data.numElements())
 for (i <- 0 until data.numElements()) {
   offsets.add(builder.getWritePos - start)
-  buildVariant(builder, data.get(i, elementType), elementType)
+  val element = if (data.isNullAt(i)) null else data.get(i, 
elementType)
+  buildVariant(builder, element, elementType)
 }
 builder.finishWritingArray(start, offsets)
   case MapType(StringType, valueType, _) =>
@@ -116,7 +117,8 @@ object VariantExpressionEvalUtils {
   val key = keys.getUTF8String(i).toString
   val id = builder.addKey(key)
   fields.add(new VariantBuilder.FieldEntry(key, id, 
builder.getWritePos - start))
-  buildVariant(builder, values.get(i, valueType), valueType)
+  val value = if (values.isNullAt(i)) null else values.get(i, 
valueType)
+  buildVariant(builder, value, valueType)
 }
 builder.finishWritingObject(start, fields)
   case StructType(structFields) =>
@@ -127,7 +129,8 @@ object VariantExpressionEvalUtils {
   val key = structFields(i).name
   val id = builder.addKey(key)
   fields.add(new VariantBuilder.FieldEntry(key, id, 
builder.getWritePos - start))
-  buildVariant(builder, data.get(i, structFields(i).dataType), 
structFields(i).dataType)
+  val value = if (data.isNullAt(i)) null else data.get(i, 
structFields(i).dataType)
+  buildVariant(builder, value, structFields(i).dataType)
 }
 builder.finishWritingObject(start, fields)
 }
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/VariantEndToEndSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/VariantEndToEndSuite.scala
index 3964bf3aedec..53be9d50d351 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/VariantEndToEndSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/VariantEndToEndSuite.scala
@@ -16,11 +16,13 @@
  */
 package org.apache.spark.sql
 
-import org.apache.spark.sql.catalyst.expressions.{CreateArray, 
CreateNamedStruct, JsonToStructs, Literal, StructsToJson}
+import org.apache.spark.sql.catalyst.expressions.{Cast, CreateArray, 
CreateNamedStruct, JsonToStructs, Literal, StructsToJson}
 import org.apache.spark.sql.catalyst.expressions.variant.ParseJson
 import org.apache.spark.sql.execution.WholeStageCodegenExec
+import org.apache.spark.sql.execution.vectorized.OnHeapColumnVector
 import org.apache.spark.sql.test.SharedSparkSession
-import org.apache.spark.sql.types.VariantType

(spark) branch master updated: [SPARK-48211][SQL] DB2: Read SMALLINT as ShortType

2024-05-09 Thread yao

This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 207d675110e6 [SPARK-48211][SQL] DB2: Read SMALLINT as ShortType
207d675110e6 is described below

commit 207d675110e6fa699a434e81296f6f050eb0304b
Author: Kent Yao 
AuthorDate: Thu May 9 17:27:04 2024 +0800

[SPARK-48211][SQL] DB2: Read SMALLINT as ShortType

### What changes were proposed in this pull request?

This PR supports read SMALLINT from DB2 as ShortType

### Why are the changes needed?

- 15 bits is sufficient
- we write ShortType as SMALLINT
- we read smallint from other builtin jdbc sources as ShortType

### Does this PR introduce _any_ user-facing change?

yes, we add a migration guide for this
### How was this patch tested?

changed tests

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #46497 from yaooqinn/SPARK-48211.

Authored-by: Kent Yao 
Signed-off-by: Kent Yao 
---
 .../spark/sql/jdbc/DB2IntegrationSuite.scala   | 69 +-
 docs/sql-migration-guide.md|  1 +
 .../org/apache/spark/sql/internal/SQLConf.scala| 11 
 .../org/apache/spark/sql/jdbc/DB2Dialect.scala |  3 +
 4 files changed, 56 insertions(+), 28 deletions(-)

diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala
index cedb33d491fb..aca174cce194 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala
@@ -25,6 +25,7 @@ import org.scalatest.time.SpanSugar._
 
 import org.apache.spark.sql.{Row, SaveMode}
 import org.apache.spark.sql.catalyst.util.DateTimeTestUtils._
+import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types.{BooleanType, ByteType, ShortType, 
StructType}
 import org.apache.spark.tags.DockerTest
 
@@ -77,32 +78,44 @@ class DB2IntegrationSuite extends 
DockerJDBCIntegrationSuite {
   }
 
   test("Numeric types") {
-val df = sqlContext.read.jdbc(jdbcUrl, "numbers", new Properties)
-val rows = df.collect()
-assert(rows.length == 1)
-val types = rows(0).toSeq.map(x => x.getClass.toString)
-assert(types.length == 10)
-assert(types(0).equals("class java.lang.Integer"))
-assert(types(1).equals("class java.lang.Integer"))
-assert(types(2).equals("class java.lang.Long"))
-assert(types(3).equals("class java.math.BigDecimal"))
-assert(types(4).equals("class java.lang.Double"))
-assert(types(5).equals("class java.lang.Double"))
-assert(types(6).equals("class java.lang.Float"))
-assert(types(7).equals("class java.math.BigDecimal"))
-assert(types(8).equals("class java.math.BigDecimal"))
-assert(types(9).equals("class java.math.BigDecimal"))
-assert(rows(0).getInt(0) == 17)
-assert(rows(0).getInt(1) == 7)
-assert(rows(0).getLong(2) == 922337203685477580L)
-val bd = new BigDecimal("123456745.567890123450")
-assert(rows(0).getAs[BigDecimal](3).equals(bd))
-assert(rows(0).getDouble(4) == 42.75)
-assert(rows(0).getDouble(5) == 5.4E-70)
-assert(rows(0).getFloat(6) == 3.4028234663852886e+38)
-assert(rows(0).getDecimal(7) == new BigDecimal("4.299900"))
-assert(rows(0).getDecimal(8) == new 
BigDecimal(".00"))
-assert(rows(0).getDecimal(9) == new 
BigDecimal("1234567891234567.123456789123456789"))
+Seq(true, false).foreach { legacy =>
+  withSQLConf(SQLConf.LEGACY_DB2_TIMESTAMP_MAPPING_ENABLED.key -> 
legacy.toString) {
+val df = sqlContext.read.jdbc(jdbcUrl, "numbers", new Properties)
+val rows = df.collect()
+assert(rows.length == 1)
+val types = rows(0).toSeq.map(x => x.getClass.toString)
+assert(types.length == 10)
+if (legacy) {
+  assert(types(0).equals("class java.lang.Integer"))
+} else {
+  assert(types(0).equals("class java.lang.Short"))
+}
+assert(types(1).equals("class java.lang.Integer"))
+assert(types(2).equals("class java.lang.Long"))
+assert(types(3).equals("class java.math.BigDecimal"))
+assert(types(4).equals("class java.lang.Double"))
+assert(types(5).equals("class java.lang.Double"))
+assert(types(6).equals("class java.lang.Float"))
+assert(types(7).equals("class java.math.BigDecimal"))
+assert(types(8).equals("class java.math.BigDecimal"))
+assert(types(9).equals("class java.math.BigDecimal"))
+if

(spark) branch master updated (ecca1bf6453e -> 9e62dbad6a75)

2024-05-09 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from ecca1bf6453e [SPARK-47365][PYTHON] Add toArrow() DataFrame method to 
PySpark
 add 9e62dbad6a75 [SPARK-48212][PYTHON][CONNECT][TESTS] Fully enable 
`PandasUDFParityTests.test_udf_wrong_arg`

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/tests/connect/test_parity_pandas_udf.py | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-47365][PYTHON] Add toArrow() DataFrame method to PySpark

2024-05-09 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ecca1bf6453e [SPARK-47365][PYTHON] Add toArrow() DataFrame method to 
PySpark
ecca1bf6453e is described below

commit ecca1bf6453e5e0042e1b56d4c35fb0b4d0f3121
Author: Ian Cook 
AuthorDate: Thu May 9 17:25:34 2024 +0900

[SPARK-47365][PYTHON] Add toArrow() DataFrame method to PySpark

### What changes were proposed in this pull request?
- Add a PySpark DataFrame method `toArrow()` which returns the contents of 
the DataFrame as a [PyArrow 
Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html), for 
both local Spark and Spark Connect.
- Add a new entry to the **Apache Arrow in PySpark** user guide page 
describing usage of the `toArrow()` method.
- Add  a new option to the method `_collect_as_arrow()` to provide more 
useful output when there are zero records returned. (This keeps the 
implementation of `toArrow()` simpler.)

### Why are the changes needed?
In the Apache Arrow community, we hear from a lot of users who want to 
return the contents of a PySpark DataFrame as a PyArrow Table. Currently the 
only documented way to do this is to return the contents as a pandas DataFrame, 
then use PyArrow (`pa`) to convert that to a PyArrow Table.
```py
pa.Table.from_pandas(df.toPandas())
```
But going through pandas adds significant overhead which is easily avoided 
since internally `toPandas()` already converts the contents of Spark DataFrame 
to Arrow format as an intermediate step when 
`spark.sql.execution.arrow.pyspark.enabled` is `true`.

Currently it is also possible to use the experimental `_collect_as_arrow()` 
method to return the contents of a PySpark DataFrame as a list of PyArrow 
RecordBatches. This PR adds a new non-experimental method `toArrow()` which 
returns the more user-friendly PyArrow Table object.

This PR also adds a new argument `empty_list_if_zero_records` to the 
experimental method `_collect_as_arrow()` to control what the method returns in 
the case when the result data has zero rows. If set to `True` (the default), 
the existing behavior is preserved, and the method returns an empty Python 
list. If set to `False`, the method returns returns a length-one list 
containing an empty Arrow RecordBatch which includes the schema. This is used 
by `toArrow()` which requires the schema [...]

For Spark Connect, there is already a `SparkSession.client.to_table()` 
method that returns a PyArrow table. This PR uses that to expose `toArrow()` 
for Spark Connect.

### Does this PR introduce _any_ user-facing change?

- It adds a DataFrame method `toArrow()` to the PySpark SQL DataFrame API.
- It adds a new argument `empty_list_if_zero_records` to the experimental 
DataFrame method `_collect_as_arrow()` with a default value which preserves the 
method's existing behavior.
- It exposes `toArrow()` for Spark Connect, via the existing 
`SparkSession.client.to_table()` method.
- It does not introduce any other user-facing changes.

### How was this patch tested?
This adds a new test and a new helper function for the test in 
`pyspark/sql/tests/test_arrow.py`.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #45481 from ianmcook/SPARK-47365.

Lead-authored-by: Ian Cook 
Co-authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 examples/src/main/python/sql/arrow.py  | 18 
 .../source/reference/pyspark.sql/dataframe.rst |  1 +
 python/docs/source/user_guide/sql/arrow_pandas.rst | 49 +++---
 python/pyspark/sql/classic/dataframe.py|  4 ++
 python/pyspark/sql/connect/dataframe.py|  4 ++
 python/pyspark/sql/dataframe.py| 30 +
 python/pyspark/sql/pandas/conversion.py| 48 +++--
 python/pyspark/sql/tests/test_arrow.py | 35 
 8 files changed, 169 insertions(+), 20 deletions(-)

diff --git a/examples/src/main/python/sql/arrow.py 
b/examples/src/main/python/sql/arrow.py
index 03daf18eadbf..48aee48d929c 100644
--- a/examples/src/main/python/sql/arrow.py
+++ b/examples/src/main/python/sql/arrow.py
@@ -33,6 +33,22 @@ require_minimum_pandas_version()
 require_minimum_pyarrow_version()
 
 
+def dataframe_to_arrow_table_example(spark: SparkSession) -> None:
+import pyarrow as pa  # noqa: F401
+from pyspark.sql.functions import rand
+
+# Create a Spark DataFrame
+df = spark.range(100).drop("id").withColumns({"0": rand(), "1": rand(), 
"2": rand()})
+
+# Convert the Spark DataFrame to a PyArrow Table
+table = df.select("*").toArrow()
+
+print(table.schema)
+# 0: double not null
+

(spark) branch master updated (34ee0d8414b2 -> 027327d94b34)

2024-05-09 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 34ee0d8414b2 [SPARK-47421][SQL] Add collation support for URL 
expressions
 add 027327d94b34 [SPARK-47986][CONNECT][PYTHON] Unable to create a new 
session when the default session is closed by the server

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/connect/client/core.py  |  1 +
 python/pyspark/sql/connect/session.py  | 20 
 .../sql/tests/connect/test_connect_session.py  | 16 -
 python/pyspark/sql/tests/connect/test_session.py   | 28 ++
 4 files changed, 59 insertions(+), 6 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (045ec6a166c8 -> 34ee0d8414b2)

2024-05-09 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 045ec6a166c8 [SPARK-48208][SS] Skip providing memory usage metrics 
from RocksDB if bounded memory usage is enabled
 add 34ee0d8414b2 [SPARK-47421][SQL] Add collation support for URL 
expressions

No new revisions were added by this update.

Summary of changes:
 .../explain-results/function_url_decode.explain|   2 +-
 .../explain-results/function_url_encode.explain|   2 +-
 .../sql/catalyst/expressions/urlExpressions.scala  |  19 ++--
 .../spark/sql/CollationSQLExpressionsSuite.scala   | 103 +
 4 files changed, 115 insertions(+), 11 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48208][SS] Skip providing memory usage metrics from RocksDB if bounded memory usage is enabled

2024-05-09 Thread kabhwan

This is an automated email from the ASF dual-hosted git repository.

kabhwan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 045ec6a166c8 [SPARK-48208][SS] Skip providing memory usage metrics 
from RocksDB if bounded memory usage is enabled
045ec6a166c8 is described below

commit 045ec6a166c8d2bdf73585fc4160c136e5f2888a
Author: Anish Shrigondekar 
AuthorDate: Thu May 9 17:10:01 2024 +0900

[SPARK-48208][SS] Skip providing memory usage metrics from RocksDB if 
bounded memory usage is enabled

### What changes were proposed in this pull request?
Skip providing memory usage metrics from RocksDB if bounded memory usage is 
enabled

### Why are the changes needed?
Without this, we are providing memory usage that is the max usage per node 
at a partition level.
For eg - if we report this
```
"allRemovalsTimeMs" : 93,
"commitTimeMs" : 32240,
"memoryUsedBytes" : 15956211724278,
"numRowsDroppedByWatermark" : 0,
"numShufflePartitions" : 200,
"numStateStoreInstances" : 200,
```

We have 200 partitions in this case.
So the memory usage per partition / state store would be ~78GB. However, 
this node has 256GB memory total and we have 2 such nodes. We have configured 
our cluster to use 30% of available memory on each node for RocksDB which is 
~77GB.
So the memory being reported here is actually per node rather than per 
partition which could be confusing for users.

### Does this PR introduce _any_ user-facing change?
No - only a metrics reporting change

### How was this patch tested?
Added unit tests

```
[info] Run completed in 10 seconds, 878 milliseconds.
[info] Total number of tests run: 24
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 24, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
```

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46491 from anishshri-db/task/SPARK-48208.

Authored-by: Anish Shrigondekar 
Signed-off-by: Jungtaek Lim 
---
 .../apache/spark/sql/execution/streaming/state/RocksDB.scala  | 11 ++-
 .../spark/sql/execution/streaming/state/RocksDBSuite.scala| 11 +++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala
index caecf817c12f..151695192281 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala
@@ -777,10 +777,19 @@ class RocksDB(
   .keys.filter(checkInternalColumnFamilies(_)).size
 val numExternalColFamilies = colFamilyNameToHandleMap.keys.size - 
numInternalColFamilies
 
+// if bounded memory usage is enabled, we share the block cache across all 
state providers
+// running on the same node and account the usage to this single cache. In 
this case, its not
+// possible to provide partition level or query level memory usage.
+val memoryUsage = if (conf.boundedMemoryUsage) {
+  0L
+} else {
+  readerMemUsage + memTableMemUsage + blockCacheUsage
+}
+
 RocksDBMetrics(
   numKeysOnLoadedVersion,
   numKeysOnWritingVersion,
-  readerMemUsage + memTableMemUsage + blockCacheUsage,
+  memoryUsage,
   pinnedBlocksMemUsage,
   totalSSTFilesBytes,
   nativeOpsLatencyMicros,
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala
index ab2afa1b8a61..6086fd43846f 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala
@@ -1699,6 +1699,11 @@ class RocksDBSuite extends 
AlsoTestWithChangelogCheckpointingEnabled with Shared
   db.load(0)
   db.put("a", "1")
   db.commit()
+  if (boundedMemoryUsage == "true") {
+assert(db.metricsOpt.get.totalMemUsageBytes === 0)
+  } else {
+assert(db.metricsOpt.get.totalMemUsageBytes > 0)
+  }
   db.getWriteBufferManagerAndCache()
 }
 
@@ -1709,6 +1714,11 @@ class RocksDBSuite extends 
AlsoTestWithChangelogCheckpointingEnabled with Shared
   db.load(0)
   db.put("a", "1")
   db.commit()
+  if (boundedMemoryUsage == "true") {
+assert(db.metricsOpt.get.totalMemUsageBytes === 0)
+  } else {
+

(spark) branch master updated (a4ab82b8f340 -> 91da4ac25148)

2024-05-09 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from a4ab82b8f340 [SPARK-48186][SQL] Add support for AbstractMapType
 add 91da4ac25148 [SPARK-47354][SQL] Add collation support for variant 
expressions

No new revisions were added by this update.

Summary of changes:
 .../function_is_variant_null.explain   |   2 +-
 .../explain-results/function_parse_json.explain|   2 +-
 .../function_schema_of_variant.explain |   2 +-
 .../function_schema_of_variant_agg.explain |   2 +-
 .../function_try_parse_json.explain|   2 +-
 .../function_try_variant_get.explain   |   2 +-
 .../explain-results/function_variant_get.explain   |   2 +-
 .../expressions/variant/variantExpressions.scala   |  23 +-
 .../spark/sql/CollationSQLExpressionsSuite.scala   | 293 -
 9 files changed, 312 insertions(+), 18 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (6cc3dc2ef4d2 -> a4ab82b8f340)

2024-05-09 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 6cc3dc2ef4d2 [SPARK-48169][SPARK-48143][SQL] Revert BadRecordException 
optimizations
 add a4ab82b8f340 [SPARK-48186][SQL] Add support for AbstractMapType

No new revisions were added by this update.

Summary of changes:
 ...stractArrayType.scala => AbstractMapType.scala} | 20 -
 .../spark/sql/catalyst/expressions/ExprUtils.scala |  9 +++---
 .../sql/catalyst/expressions/jsonExpressions.scala |  5 ++--
 .../spark/sql/CollationSQLExpressionsSuite.scala   | 34 ++
 4 files changed, 55 insertions(+), 13 deletions(-)
 copy 
sql/api/src/main/scala/org/apache/spark/sql/internal/types/{AbstractArrayType.scala
 => AbstractMapType.scala} (58%)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48169][SPARK-48143][SQL] Revert BadRecordException optimizations

2024-05-09 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6cc3dc2ef4d2 [SPARK-48169][SPARK-48143][SQL] Revert BadRecordException 
optimizations
6cc3dc2ef4d2 is described below

commit 6cc3dc2ef4d2ffbff7ffc400e723b97b462e1bab
Author: Vladimir Golubev 
AuthorDate: Thu May 9 15:35:28 2024 +0800

[SPARK-48169][SPARK-48143][SQL] Revert BadRecordException optimizations

### What changes were proposed in this pull request?
Revert BadRecordException optimizations for UnivocityParser, StaxXmlParser 
and JacksonParser

### Why are the changes needed?
To reduce the blast radius - this will be implemented differently. There 
were two PRs by me recently:
- https://github.com/apache/spark/pull/46438
- https://github.com/apache/spark/pull/46400

which introduced optimizations to speed-up control flow between 
UnivocityParser, StaxXmlParser and JacksonParser. However, these changes are 
quite unstable and may break any calling code, which relies on exception cause 
type, for example. Also, there may be some Spark plugins/extensions using that 
exception for user-facing errors

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
N/A

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46478 from vladimirg-db/vladimirg-db/revert-SPARK-48169-SPARK-48143.

Authored-by: Vladimir Golubev 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/csv/UnivocityParser.scala   |  8 
 .../spark/sql/catalyst/json/JacksonParser.scala| 13 ++--
 .../sql/catalyst/util/BadRecordException.scala | 13 +++-
 .../sql/catalyst/util/FailureSafeParser.scala  |  2 +-
 .../spark/sql/catalyst/xml/StaxXmlParser.scala | 23 +++---
 5 files changed, 26 insertions(+), 33 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
index 8d06789a7512..a5158d8a22c6 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
@@ -316,17 +316,17 @@ class UnivocityParser(
   throw BadRecordException(
 () => getCurrentInput,
 () => Array.empty,
-() => QueryExecutionErrors.malformedCSVRecordError(""))
+QueryExecutionErrors.malformedCSVRecordError(""))
 }
 
 val currentInput = getCurrentInput
 
-var badRecordException: Option[() => Throwable] = if (tokens.length != 
parsedSchema.length) {
+var badRecordException: Option[Throwable] = if (tokens.length != 
parsedSchema.length) {
   // If the number of tokens doesn't match the schema, we should treat it 
as a malformed record.
   // However, we still have chance to parse some of the tokens. It 
continues to parses the
   // tokens normally and sets null when `ArrayIndexOutOfBoundsException` 
occurs for missing
   // tokens.
-  Some(() => 
QueryExecutionErrors.malformedCSVRecordError(currentInput.toString))
+  Some(QueryExecutionErrors.malformedCSVRecordError(currentInput.toString))
 } else None
 // When the length of the returned tokens is identical to the length of 
the parsed schema,
 // we just need to:
@@ -348,7 +348,7 @@ class UnivocityParser(
   } catch {
 case e: SparkUpgradeException => throw e
 case NonFatal(e) =>
-  badRecordException = badRecordException.orElse(Some(() => e))
+  badRecordException = badRecordException.orElse(Some(e))
   // Use the corresponding DEFAULT value associated with the column, 
if any.
   row.update(i, 
ResolveDefaultColumns.existenceDefaultValues(requiredSchema)(i))
   }
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
index 848c20ee36be..5e75ff6f6e1a 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
@@ -613,7 +613,7 @@ class JacksonParser(
 // JSON parser currently doesn't support partial results for corrupted 
records.
 // For such records, all fields other than the field configured by
 // `columnNameOfCorruptRecord` are set to `null`.
-throw BadRecordException(() => recordLiteral(record), cause = () => e)
+throw BadRecordException(() => recordLiteral(record), () => 
Array.empty, e)
   case e: CharConversionException if options.encoding.isEmpty =>

(spark) branch branch-3.5 updated: [SPARK-47847][CORE] Deprecate `spark.network.remoteReadNioBufferConversion`

(spark) branch master updated (8ccc8b92be50 -> 33cac4436e59)

(spark) branch master updated: [SPARK-48201][DOCS][PYTHON] Make some corrections in the docstring of pyspark DataStreamReader methods

(spark) branch master updated: [SPARK-48228][PYTHON][CONNECT] Implement the missing function validation in ApplyInXXX

(spark) branch master updated: [SPARK-48224][SQL] Disallow map keys from being of variant type

(spark) branch master updated: [SPARK-47018][BUILD][SQL] Bump built-in Hive to 2.3.10

(spark) branch master updated: [MINOR][BUILD] Remove duplicate configuration of maven-compiler-plugin

(spark) branch master updated: [SPARK-47834][SQL][CONNECT] Mark deprecated functions with `@deprecated` in `SQLImplicits`

(spark) branch master updated: [SPARK-48176][SQL] Adjust name of FIELD_ALREADY_EXISTS error condition

(spark) branch master updated: [SPARK-48222][INFRA][DOCS] Sync Ruby Bundler to 2.4.22 and refresh Gem lock file

(spark) branch master updated: [SPARK-48227][PYTHON][DOC] Document the requirement of seed in protos

(spark) branch master updated: [SPARK-48180][SQL] Improve error when UDTF call with TABLE arg forgets parentheses around multiple PARTITION/ORDER BY exprs

(spark) branch master updated (b47d7853d92f -> e704b9e56b0c)

(spark) branch master updated: [SPARK-48148][CORE] JSON objects should not be modified when read as STRING

(spark) branch branch-3.5 updated (da4c808be7d6 -> dc4911725baa)

(spark) branch branch-3.5 updated: [SPARK-48197][SQL][TESTS][FOLLOWUP][3.5] Regenerate golden files

svn commit: r69065 - /dev/spark/v4.0.0-preview1-rc1-bin/

(spark) branch master updated: [SPARK-48216][TESTS] Remove overrides DockerJDBCIntegrationSuite.connectionTimeout to make related tests configurable

(spark) branch master updated: [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions (for UTF8_BINARY & LCASE)

(spark) branch master updated: [SPARK-47803][FOLLOWUP] Check nulls when casting nested type to variant

(spark) branch master updated: [SPARK-48211][SQL] DB2: Read SMALLINT as ShortType

(spark) branch master updated (ecca1bf6453e -> 9e62dbad6a75)

(spark) branch master updated: [SPARK-47365][PYTHON] Add toArrow() DataFrame method to PySpark

(spark) branch master updated (34ee0d8414b2 -> 027327d94b34)

(spark) branch master updated (045ec6a166c8 -> 34ee0d8414b2)

(spark) branch master updated: [SPARK-48208][SS] Skip providing memory usage metrics from RocksDB if bounded memory usage is enabled

(spark) branch master updated (a4ab82b8f340 -> 91da4ac25148)

(spark) branch master updated (6cc3dc2ef4d2 -> a4ab82b8f340)

(spark) branch master updated: [SPARK-48169][SPARK-48143][SQL] Revert BadRecordException optimizations

29 matches

Site Navigation

Mail list logo

Footer information