(spark) branch master updated: [SPARK-47701][SQL][TESTS] Postgres: Add test for Composite and Range types

2024-04-02 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9a20794b252d [SPARK-47701][SQL][TESTS] Postgres: Add test for 
Composite and Range types
9a20794b252d is described below

commit 9a20794b252d207b6864f656e7fab85007911537
Author: Kent Yao 
AuthorDate: Tue Apr 2 22:43:05 2024 -0700

[SPARK-47701][SQL][TESTS] Postgres: Add test for Composite and Range types

### What changes were proposed in this pull request?

Add tests for Composite and Range types of postgres.

### Why are the changes needed?

test improvments

### Does this PR introduce _any_ user-facing change?

no, test-only

### How was this patch tested?

new tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #45827 from yaooqinn/SPARK-47701.

Authored-by: Kent Yao 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/jdbc/PostgresIntegrationSuite.scala  | 26 ++
 1 file changed, 26 insertions(+)

diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala
index 9015434cedd0..f70bd8091204 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala
@@ -178,6 +178,15 @@ class PostgresIntegrationSuite extends 
DockerJDBCIntegrationSuite {
 conn.prepareStatement("CREATE TABLE test_bit_array (c1 bit(1)[], c2 
bit(5)[])").executeUpdate()
 conn.prepareStatement("INSERT INTO test_bit_array VALUES (ARRAY[B'1', 
B'0'], " +
   "ARRAY[B'1', B'00010'])").executeUpdate()
+
+conn.prepareStatement(
+  """
+|CREATE TYPE complex AS (
+|b   bool,
+|d   double precision
+|)""".stripMargin).executeUpdate()
+conn.prepareStatement("CREATE TABLE complex_table (c1 
complex)").executeUpdate()
+conn.prepareStatement("INSERT INTO complex_table VALUES (ROW(true, 
1.0))").executeUpdate()
   }
 
   test("Type mapping for various types") {
@@ -516,4 +525,21 @@ class PostgresIntegrationSuite extends 
DockerJDBCIntegrationSuite {
   },
   errorClass = null)
   }
+
+  test("SPARK-47701: Reading complex type") {
+val df = spark.read.jdbc(jdbcUrl, "complex_table", new Properties)
+checkAnswer(df, Row("(t,1)"))
+val df2 = spark.read.format("jdbc")
+  .option("url", jdbcUrl)
+  .option("query", "SELECT (c1).b, (c1).d FROM complex_table").load()
+checkAnswer(df2, Row(true, 1.0d))
+  }
+
+  test("SPARK-47701: Range Types") {
+val df = spark.read.format("jdbc")
+  .option("url", jdbcUrl)
+  .option("query", "SELECT '[3,7)'::int4range")
+  .load()
+checkAnswer(df, Row("[3,7)"))
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-45733][PYTHON][TESTS][FOLLOWUP] Skip `pyspark.sql.tests.connect.client.test_client` if not should_test_connect

2024-04-02 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 360a3f9023d0 [SPARK-45733][PYTHON][TESTS][FOLLOWUP] Skip 
`pyspark.sql.tests.connect.client.test_client` if not should_test_connect
360a3f9023d0 is described below

commit 360a3f9023d08812e3f3c44af9cdac644c5d67b2
Author: Dongjoon Hyun 
AuthorDate: Tue Apr 2 22:30:08 2024 -0700

[SPARK-45733][PYTHON][TESTS][FOLLOWUP] Skip 
`pyspark.sql.tests.connect.client.test_client` if not should_test_connect

### What changes were proposed in this pull request?

This is a follow-up of the following.
- https://github.com/apache/spark/pull/43591

### Why are the changes needed?

This test requires `pandas` which is an optional dependency in Apache Spark.

```
$ python/run-tests --modules=pyspark-connect --parallelism=1 
--python-executables=python3.10  --testnames 
'pyspark.sql.tests.connect.client.test_client'
Running PySpark tests. Output is in 
/Users/dongjoon/APACHE/spark-merge/python/unit-tests.log
Will test against the following Python executables: ['python3.10']
Will test the following Python tests: 
['pyspark.sql.tests.connect.client.test_client']
python3.10 python_implementation is CPython
python3.10 version is: Python 3.10.13
Starting test(python3.10): pyspark.sql.tests.connect.client.test_client 
(temp output: 
/Users/dongjoon/APACHE/spark-merge/python/target/216a8716-3a1f-4cf9-9c7c-63087f29f892/python3.10__pyspark.sql.tests.connect.client.test_client__tydue4ck.log)
Traceback (most recent call last):
  File "/Users/dongjoon/.pyenv/versions/3.10.13/lib/python3.10/runpy.py", 
line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
  File "/Users/dongjoon/.pyenv/versions/3.10.13/lib/python3.10/runpy.py", 
line 86, in _run_code
exec(code, run_globals)
  File 
"/Users/dongjoon/APACHE/spark-merge/python/pyspark/sql/tests/connect/client/test_client.py",
 line 137, in 
class TestPolicy(DefaultPolicy):
NameError: name 'DefaultPolicy' is not defined
```

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Pass the CIs and manually test without `pandas`.
```
$ pip3 uninstall pandas
$ python/run-tests --modules=pyspark-connect --parallelism=1 
--python-executables=python3.10  --testnames 
'pyspark.sql.tests.connect.client.test_client'
Running PySpark tests. Output is in 
/Users/dongjoon/APACHE/spark-merge/python/unit-tests.log
Will test against the following Python executables: ['python3.10']
Will test the following Python tests: 
['pyspark.sql.tests.connect.client.test_client']
python3.10 python_implementation is CPython
python3.10 version is: Python 3.10.13
Starting test(python3.10): pyspark.sql.tests.connect.client.test_client 
(temp output: 
/Users/dongjoon/APACHE/spark-merge/python/target/acf07ed5-938a-4272-87e1-47e3bf8b988e/python3.10__pyspark.sql.tests.connect.client.test_client__sfdosnek.log)
Finished test(python3.10): pyspark.sql.tests.connect.client.test_client 
(0s) ... 13 tests were skipped
Tests passed in 0 seconds

Skipped tests in pyspark.sql.tests.connect.client.test_client with 
python3.10:
  test_basic_flow 
(pyspark.sql.tests.connect.client.test_client.SparkConnectClientReattachTestCase)
 ... skip (0.002s)
  test_fail_and_retry_during_execute 
(pyspark.sql.tests.connect.client.test_client.SparkConnectClientReattachTestCase)
 ... skip (0.000s)
  test_fail_and_retry_during_reattach 
(pyspark.sql.tests.connect.client.test_client.SparkConnectClientReattachTestCase)
 ... skip (0.000s)
  test_fail_during_execute 
(pyspark.sql.tests.connect.client.test_client.SparkConnectClientReattachTestCase)
 ... skip (0.000s)
  test_channel_builder 
(pyspark.sql.tests.connect.client.test_client.SparkConnectClientTestCase) ... 
skip (0.000s)
  test_channel_builder_with_session 
(pyspark.sql.tests.connect.client.test_client.SparkConnectClientTestCase) ... 
skip (0.000s)
  test_interrupt_all 
(pyspark.sql.tests.connect.client.test_client.SparkConnectClientTestCase) ... 
skip (0.000s)
  test_is_closed 
(pyspark.sql.tests.connect.client.test_client.SparkConnectClientTestCase) ... 
skip (0.000s)
  test_properties 
(pyspark.sql.tests.connect.client.test_client.SparkConnectClientTestCase) ... 
skip (0.000s)
  test_retry 
(pyspark.sql.tests.connect.client.test_client.SparkConnectClientTestCase) ... 
skip (0.000s)
  test_retry_client_unit 
(pyspark.sql.tests.connect.client.test_client.SparkConnectClientTestCase) ... 
skip (0.000s)
  test_user_agent_default 
(pyspark.sql.tests.connect.client.test_client.SparkConnectClientTestCase) ... 
skip 

Re: [PR] Add Spark Connect page [spark-website]

2024-04-02 Thread via GitHub


dongjoon-hyun commented on code in PR #511:
URL: https://github.com/apache/spark-website/pull/511#discussion_r1548892591


##
spark-connect/index.md:
##
@@ -0,0 +1,151 @@
+---
+layout: global
+type: "page singular"
+title: Spark Connect
+description: Spark Connect makes remote Spark development easier.
+subproject: Spark Connect
+---
+
+This post explains the Spark Connect architecture, the benefits of Spark 
Connect, and how to upgrade to Spark Connect.
+
+Let’s start by exploring the architecture of Spark Connect at a high level.
+
+## High-level Spark Connect architecture
+
+Spark Connect is a protocol that specifies how a client application can 
communicate with a remote Spark Server.  Clients that implement the Spark 
Connect protocol can connect and make requests to remote Spark Servers, very 
similar to how client applications can connect to databases using a JDBC driver 
- a query `spark.table("some_table").limit(5)` should simply return the result. 
 This architecture gives end users a great developer experience.
+
+Here’s how Spark Connect works at a high level:
+
+1. A connection is established between the Client and Spark Server
+2. The Client converts a DataFrame query to an unresolved logical plan
+3. The unresolved logical plan is encoded and sent to the Spark Server
+4. The Spark Server runs the query
+5. The Spark Server sends the results back to the Client
+
+
+
+Let’s go through these steps in more detail to get a better understanding of 
the inner workings of Spark Connect.
+
+**Establishing a connection between the Client and Spark Server**
+
+The network communication for Spark Connect uses the [gRPC 
framework](https://grpc.io/).
+
+gRPC is performant and language agnostic.  Spark Connect uses 
language-agnostic technologies, so it’s portable.
+
+**Converting a DataFrame query to an unresolved logical plan**
+
+The Client parses DataFrame queries and converts them to unresolved logical 
plans.
+
+Suppose you have the following DataFrame query: 
`spark.table("some_table").limit(5)`.
+
+Here’s the unresolved logical plan for the query: 
+
+```
+== Parsed Logical Plan ==
+GlobalLimit 5
++- LocalLimit 5
+   +- SubqueryAlias spark_catalog.default.some_table
+  +- Relation spark_catalog.default.some_table[character#15,franchise#16] 
parquet
+```
+
+The Client is responsible for creating the unresolved logical plan and passing 
it to the Spark Server for execution.
+
+**Sending the unresolved logical plan to the Spark Server**
+
+The unresolved logical plan must be serialized so it can be sent over a 
network.  Spark Connect uses Protocol Buffers, which are “language-neutral, 
platform-neutral extensible mechanisms for serializing structured data”.
+
+The Client and the Spark Server must be able to communicate with a 
language-neutral format like Protocol Buffers because they might be using 
different programming languages or different software versions.
+
+Now let’s look at how the Spark Server executes the query. 
+
+**Executing the query on the Spark Server**
+
+The Spark Server receives the unresolved logical plan (once the Protocol 
Buffer is deserialized) and executes it just like any other query.
+
+Spark performs many optimizations to an unresolved logical plan before 
executing the query.  All of these optimizations happen on the Spark Server.
+
+Spark Connect lets you leverage Spark’s powerful query optimization 
capabilities, even with Clients that don’t depend on Spark or the JVM.
+
+**Sending the results back to the Client**
+
+The Spark Server sends the results back to the Client after executing the 
query.
+
+The results are sent to the client as Apache Arrow record batches.  A record 
batch includes many rows of data.
+
+The record batch is streamed to the client, which means it is sent in partial 
chunks, not all at once.  Streaming the results from the Spark Server to the 
Client prevents memory issues caused by an excessively large request.
+
+Here’s a recap of how Spark Connect works in image form:
+
+
+
+## Benefits of Spark Connect
+
+Let’s now turn our attention to the benefits of the Spark Connect architecture.
+
+**Spark Connect workloads are easier to maintain**
+
+With the Spark JVM architecture, the client and Spark Driver must run 
identical software versions.  They need the same Java, Scala, and other 
dependency versions.  Suppose you develop a Spark project on your local 
machine, package it as a JAR file, and deploy it in the cloud to run on a 
production dataset.  You need to build the JAR file on your local machine with 
the same dependencies used in the cloud.  If you compile the JAR file with 
Scala 2.13, you must also provision the cluster with a Spark JAR compiled with 
Scala 2.13.
+
+Suppose you are building your JAR with Scala 2.12, and your cloud provider 
releases a new runtime built with Scala 2.13.  For Spark JVM, you need to 
update your project locally, which may be challenging.  For example, when you 
update your project 

Re: [PR] Add Spark Connect page [spark-website]

2024-04-02 Thread via GitHub


dongjoon-hyun commented on code in PR #511:
URL: https://github.com/apache/spark-website/pull/511#discussion_r1548892050


##
spark-connect/index.md:
##
@@ -0,0 +1,151 @@
+---
+layout: global
+type: "page singular"
+title: Spark Connect
+description: Spark Connect makes remote Spark development easier.
+subproject: Spark Connect
+---
+
+This post explains the Spark Connect architecture, the benefits of Spark 
Connect, and how to upgrade to Spark Connect.
+
+Let’s start by exploring the architecture of Spark Connect at a high level.
+
+## High-level Spark Connect architecture
+
+Spark Connect is a protocol that specifies how a client application can 
communicate with a remote Spark Server.  Clients that implement the Spark 
Connect protocol can connect and make requests to remote Spark Servers, very 
similar to how client applications can connect to databases using a JDBC driver 
- a query `spark.table("some_table").limit(5)` should simply return the result. 
 This architecture gives end users a great developer experience.
+
+Here’s how Spark Connect works at a high level:
+
+1. A connection is established between the Client and Spark Server
+2. The Client converts a DataFrame query to an unresolved logical plan
+3. The unresolved logical plan is encoded and sent to the Spark Server
+4. The Spark Server runs the query
+5. The Spark Server sends the results back to the Client
+
+
+
+Let’s go through these steps in more detail to get a better understanding of 
the inner workings of Spark Connect.
+
+**Establishing a connection between the Client and Spark Server**
+
+The network communication for Spark Connect uses the [gRPC 
framework](https://grpc.io/).
+
+gRPC is performant and language agnostic.  Spark Connect uses 
language-agnostic technologies, so it’s portable.
+
+**Converting a DataFrame query to an unresolved logical plan**
+
+The Client parses DataFrame queries and converts them to unresolved logical 
plans.
+
+Suppose you have the following DataFrame query: 
`spark.table("some_table").limit(5)`.
+
+Here’s the unresolved logical plan for the query: 
+
+```
+== Parsed Logical Plan ==
+GlobalLimit 5
++- LocalLimit 5
+   +- SubqueryAlias spark_catalog.default.some_table
+  +- Relation spark_catalog.default.some_table[character#15,franchise#16] 
parquet
+```
+
+The Client is responsible for creating the unresolved logical plan and passing 
it to the Spark Server for execution.
+
+**Sending the unresolved logical plan to the Spark Server**
+
+The unresolved logical plan must be serialized so it can be sent over a 
network.  Spark Connect uses Protocol Buffers, which are “language-neutral, 
platform-neutral extensible mechanisms for serializing structured data”.
+
+The Client and the Spark Server must be able to communicate with a 
language-neutral format like Protocol Buffers because they might be using 
different programming languages or different software versions.
+
+Now let’s look at how the Spark Server executes the query. 
+
+**Executing the query on the Spark Server**
+
+The Spark Server receives the unresolved logical plan (once the Protocol 
Buffer is deserialized) and executes it just like any other query.
+
+Spark performs many optimizations to an unresolved logical plan before 
executing the query.  All of these optimizations happen on the Spark Server.
+
+Spark Connect lets you leverage Spark’s powerful query optimization 
capabilities, even with Clients that don’t depend on Spark or the JVM.
+
+**Sending the results back to the Client**
+
+The Spark Server sends the results back to the Client after executing the 
query.
+
+The results are sent to the client as Apache Arrow record batches.  A record 
batch includes many rows of data.
+
+The record batch is streamed to the client, which means it is sent in partial 
chunks, not all at once.  Streaming the results from the Spark Server to the 
Client prevents memory issues caused by an excessively large request.
+
+Here’s a recap of how Spark Connect works in image form:
+
+
+
+## Benefits of Spark Connect
+
+Let’s now turn our attention to the benefits of the Spark Connect architecture.
+
+**Spark Connect workloads are easier to maintain**
+
+With the Spark JVM architecture, the client and Spark Driver must run 
identical software versions.  They need the same Java, Scala, and other 
dependency versions.  Suppose you develop a Spark project on your local 
machine, package it as a JAR file, and deploy it in the cloud to run on a 
production dataset.  You need to build the JAR file on your local machine with 
the same dependencies used in the cloud.  If you compile the JAR file with 
Scala 2.13, you must also provision the cluster with a Spark JAR compiled with 
Scala 2.13.
+
+Suppose you are building your JAR with Scala 2.12, and your cloud provider 
releases a new runtime built with Scala 2.13.  For Spark JVM, you need to 
update your project locally, which may be challenging.  For example, when you 
update your project 

(spark) branch master updated (2c2a2adc3275 -> 344f640b2f35)

2024-04-02 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 2c2a2adc3275 [SPARK-47655][SS] Integrate timer with Initial State 
handling for state-v2
 add 344f640b2f35 [SPARK-47454][PYTHON][TESTS][FOLLOWUP] Skip 
`test_create_dataframe_from_pandas_with_day_time_interval` if pandas is not 
avaiable

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/tests/test_creation.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (49b7b6b9fe6b -> 2c2a2adc3275)

2024-04-02 Thread kabhwan
This is an automated email from the ASF dual-hosted git repository.

kabhwan pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 49b7b6b9fe6b [SPARK-47691][SQL] Postgres: Support multi dimensional 
array on the write side
 add 2c2a2adc3275 [SPARK-47655][SS] Integrate timer with Initial State 
handling for state-v2

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/streaming/StatefulProcessor.scala|   7 +-
 .../streaming/TransformWithStateExec.scala |   3 +-
 .../TransformWithStateInitialStateSuite.scala  | 124 -
 .../sql/streaming/TransformWithStateSuite.scala|  60 ++
 4 files changed, 164 insertions(+), 30 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (55b5ff6f45fd -> 49b7b6b9fe6b)

2024-04-02 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 55b5ff6f45fd [SPARK-47669][SQL][CONNECT][PYTHON] Add `Column.try_cast`
 add 49b7b6b9fe6b [SPARK-47691][SQL] Postgres: Support multi dimensional 
array on the write side

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/jdbc/PostgresIntegrationSuite.scala  | 23 ++
 .../sql/execution/datasources/jdbc/JdbcUtils.scala | 23 +-
 .../apache/spark/sql/jdbc/PostgresDialect.scala|  2 +-
 3 files changed, 42 insertions(+), 6 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



Re: [PR] Add Spark Connect page [spark-website]

2024-04-02 Thread via GitHub


MrPowers commented on PR #511:
URL: https://github.com/apache/spark-website/pull/511#issuecomment-2033318574

   Thank you for the review @dongjoon-hyun.  This pull request was merged, so I 
will address your comments in another PR.  I really appreciate your feedback.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-47669][SQL][CONNECT][PYTHON] Add `Column.try_cast`

2024-04-02 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 55b5ff6f45fd [SPARK-47669][SQL][CONNECT][PYTHON] Add `Column.try_cast`
55b5ff6f45fd is described below

commit 55b5ff6f45fde0048cfd4d8d1a41d6e7f65fd121
Author: Ruifeng Zheng 
AuthorDate: Wed Apr 3 08:15:08 2024 +0800

[SPARK-47669][SQL][CONNECT][PYTHON] Add `Column.try_cast`

### What changes were proposed in this pull request?
Add `try_cast` function in Column APIs

### Why are the changes needed?
for functionality parity

### Does this PR introduce _any_ user-facing change?
yes

```
>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame(
...  [(2, "123"), (5, "Bob"), (3, None)], ["age", "name"])
>>> df.select(df.name.try_cast("double")).show()
+-+
| name|
+-+
|123.0|
| NULL|
| NULL|
+-+

```

### How was this patch tested?
new tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #45796 from zhengruifeng/connect_try_cast.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../main/scala/org/apache/spark/sql/Column.scala   |  35 
 .../apache/spark/sql/PlanGenerationTestSuite.scala |   4 +
 .../main/protobuf/spark/connect/expressions.proto  |  10 +++
 .../explain-results/column_try_cast.explain|   2 +
 .../query-tests/queries/column_try_cast.json   |  29 +++
 .../query-tests/queries/column_try_cast.proto.bin  | Bin 0 -> 173 bytes
 .../sql/connect/planner/SparkConnectPlanner.scala  |  18 ++--
 .../docs/source/reference/pyspark.sql/column.rst   |   1 +
 python/pyspark/sql/column.py   |  62 +
 python/pyspark/sql/connect/column.py   |  17 
 python/pyspark/sql/connect/expressions.py  |  14 +++
 .../pyspark/sql/connect/proto/expressions_pb2.py   |  96 +++--
 .../pyspark/sql/connect/proto/expressions_pb2.pyi  |  28 ++
 .../main/scala/org/apache/spark/sql/Column.scala   |  37 
 14 files changed, 301 insertions(+), 52 deletions(-)

diff --git 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Column.scala 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Column.scala
index 4cb99541ccf0..dec699f4f1a8 100644
--- 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Column.scala
+++ 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Column.scala
@@ -1090,6 +1090,41 @@ class Column private[sql] (@DeveloperApi val expr: 
proto.Expression) extends Log
*/
   def cast(to: String): Column = cast(DataTypeParser.parseDataType(to))
 
+  /**
+   * Casts the column to a different data type and the result is null on 
failure.
+   * {{{
+   *   // Casts colA to IntegerType.
+   *   import org.apache.spark.sql.types.IntegerType
+   *   df.select(df("colA").try_cast(IntegerType))
+   *
+   *   // equivalent to
+   *   df.select(df("colA").try_cast("int"))
+   * }}}
+   *
+   * @group expr_ops
+   * @since 4.0.0
+   */
+  def try_cast(to: DataType): Column = Column { builder =>
+builder.getCastBuilder
+  .setExpr(expr)
+  .setType(DataTypeProtoConverter.toConnectProtoType(to))
+  .setEvalMode(proto.Expression.Cast.EvalMode.EVAL_MODE_TRY)
+  }
+
+  /**
+   * Casts the column to a different data type and the result is null on 
failure.
+   * {{{
+   *   // Casts colA to integer.
+   *   df.select(df("colA").try_cast("int"))
+   * }}}
+   *
+   * @group expr_ops
+   * @since 4.0.0
+   */
+  def try_cast(to: String): Column = {
+try_cast(DataTypeParser.parseDataType(to))
+  }
+
   /**
* Returns a sort expression based on the descending order of the column.
* {{{
diff --git 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala
 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala
index 46789057ed3c..5fde8b04735b 100644
--- 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala
+++ 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala
@@ -871,6 +871,10 @@ class PlanGenerationTestSuite
 fn.col("a").cast("long")
   }
 
+  columnTest("try_cast") {
+fn.col("a").try_cast("long")
+  }
+
   orderColumnTest("desc") {
 fn.col("b").desc
   }
diff --git 
a/connector/connect/common/src/main/protobuf/spark/connect/expressions.proto 
b/connector/connect/common/src/main/protobuf/spark/connect/expressions.proto
index c636bf68..726ae5dd1c21 100644
--- 

Re: [PR] Add Spark Connect page [spark-website]

2024-04-02 Thread via GitHub


MrPowers commented on code in PR #511:
URL: https://github.com/apache/spark-website/pull/511#discussion_r1548753924


##
spark-connect/index.md:
##
@@ -0,0 +1,151 @@
+---
+layout: global
+type: "page singular"
+title: Spark Connect
+description: Spark Connect makes remote Spark development easier.
+subproject: Spark Connect
+---
+
+This post explains the Spark Connect architecture, the benefits of Spark 
Connect, and how to upgrade to Spark Connect.

Review Comment:
   No, I didn't copy this from anywhere, but happy to reword it.  Any 
suggestions?  Perhaps "This page" would be better.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



Re: [PR] Add Spark Connect page [spark-website]

2024-04-02 Thread via GitHub


dongjoon-hyun commented on code in PR #511:
URL: https://github.com/apache/spark-website/pull/511#discussion_r1548741671


##
spark-connect/index.md:
##
@@ -0,0 +1,151 @@
+---
+layout: global
+type: "page singular"
+title: Spark Connect
+description: Spark Connect makes remote Spark development easier.
+subproject: Spark Connect
+---
+
+This post explains the Spark Connect architecture, the benefits of Spark 
Connect, and how to upgrade to Spark Connect.

Review Comment:
   `This post` sounds a little weird in Apache Spark website. Is this copied 
from your blog?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



Re: [PR] Add Spark Connect page [spark-website]

2024-04-02 Thread via GitHub


srowen commented on PR #511:
URL: https://github.com/apache/spark-website/pull/511#issuecomment-2033256929

   Merged to asf-site


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



Re: [PR] Add Spark Connect page [spark-website]

2024-04-02 Thread via GitHub


srowen closed pull request #511: Add Spark Connect page
URL: https://github.com/apache/spark-website/pull/511


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-47697][INFRA] Add Scala style check for invalid MDC usage

2024-04-02 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 27fdf96842ea [SPARK-47697][INFRA] Add Scala style check for invalid 
MDC usage
27fdf96842ea is described below

commit 27fdf96842ea5a98ea5835bbf78b33b26db1fd3b
Author: Gengliang Wang 
AuthorDate: Tue Apr 2 15:45:23 2024 -0700

[SPARK-47697][INFRA] Add Scala style check for invalid MDC usage

### What changes were proposed in this pull request?

Add Scala style check for invalid MDC usage to avoid invalid MDC usage 
`s"Task ${MDC(TASK_ID, taskId)} failed"`, which should be `log"Task 
${MDC(TASK_ID, taskId)} failed"`.

### Why are the changes needed?

This makes development and PR review of the structured logging migration 
easier.

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

Manual test, verified it will throw errors on invalid MDC usage.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #45823 from gengliangwang/style.

Authored-by: Gengliang Wang 
Signed-off-by: Dongjoon Hyun 
---
 scalastyle-config.xml | 5 +
 1 file changed, 5 insertions(+)

diff --git a/scalastyle-config.xml b/scalastyle-config.xml
index 50bd5a33ccb2..cd5a576c086f 100644
--- a/scalastyle-config.xml
+++ b/scalastyle-config.xml
@@ -150,6 +150,11 @@ This file is divided into 3 sections:
   // scalastyle:on println]]>
   
 
+  
+s".*\$\{MDC\(
+
+  
+
   
 spark(.sqlContext)?.sparkContext.hadoopConfiguration
 

Re: [PR] Add Spark Connect page [spark-website]

2024-04-02 Thread via GitHub


MrPowers commented on PR #511:
URL: https://github.com/apache/spark-website/pull/511#issuecomment-2033168687

   @srowen - I rebased, good call.
   
   I could also break this up into two separate commits if that would help.  
Sorry for sending the massive PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



Re: [PR] Add Spark Connect page [spark-website]

2024-04-02 Thread via GitHub


srowen commented on PR #511:
URL: https://github.com/apache/spark-website/pull/511#issuecomment-2033120138

   For good measure, rebase this? I think this also include your change that 
was merged months ago at 
https://github.com/apache/spark-website/commit/a6ce63fb9c82dc8f25f42f377b487c0de2aff826
 , so I suspect this branch is stale. It might merge, but would make it easier 
to review, a little bit


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



Re: [PR] Add Spark Connect page [spark-website]

2024-04-02 Thread via GitHub


MrPowers commented on PR #511:
URL: https://github.com/apache/spark-website/pull/511#issuecomment-2033080398

   Here's the main file that should be reviewed: 
https://github.com/apache/spark-website/pull/511/files#diff-65090e623c6aabb88b162b7e17e2a81dc359e84234e9fac9561340f7acd05431


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[PR] Add Spark Connect page [spark-website]

2024-04-02 Thread via GitHub


MrPowers opened a new pull request, #511:
URL: https://github.com/apache/spark-website/pull/511

   This PR adds a Spark Connect page.
   
   Sorry for the huge PR.  This PR adds Spark Connect to the dropdown which 
apparently modifies a bunch of HTML files.  Let me know if this is OK!
   
   https://github.com/apache/spark-website/assets/2722395/e1279758-e216-4123-acbc-1b244d9c63f6;>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-47699][BUILD] Upgrade `gcs-connector` to 2.2.21 and add a note for 3.0.0

2024-04-02 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d11d9cf729ab [SPARK-47699][BUILD] Upgrade `gcs-connector` to 2.2.21 
and add a note for 3.0.0
d11d9cf729ab is described below

commit d11d9cf729ab699c68770337d35043ebf58195cf
Author: Dongjoon Hyun 
AuthorDate: Tue Apr 2 13:31:18 2024 -0700

[SPARK-47699][BUILD] Upgrade `gcs-connector` to 2.2.21 and add a note for 
3.0.0

### What changes were proposed in this pull request?

This PR aims to upgrade `gcs-connector` to 2.2.21 and add a note for 3.0.0.

### Why are the changes needed?

This PR aims to upgrade `gcs-connector` to bring the latest bug fixes.

However, due to the following, we stick to use 2.2.21.
- https://github.com/GoogleCloudDataproc/hadoop-connectors/issues/1114
  - `gcs-connector` 2.2.21 has shaded Guava 32.1.2-jre.
- 
https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/15c8ee41a15d6735442f36333f1d67792c93b9cf/pom.xml#L100

  - `gcs-connector` 3.0.0 has shaded Guava 31.1-jre.
- 
https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/667bf17291dbaa96a60f06df58c7a528bc4a8f79/pom.xml#L97

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually.
```
$ dev/make-distribution.sh -Phadoop-cloud
$ cd dist
$ export KEYFILE=~/.ssh/apache-spark.json
$ export EMAIL=$(jq -r '.client_email' < $KEYFILE)
$ export PRIVATE_KEY_ID=$(jq -r '.private_key_id' < $KEYFILE)
$ export PRIVATE_KEY="$(jq -r '.private_key' < $KEYFILE)"
$ bin/spark-shell \
-c spark.hadoop.fs.gs.auth.service.account.email=$EMAIL \
-c 
spark.hadoop.fs.gs.auth.service.account.private.key.id=$PRIVATE_KEY_ID \
-c 
spark.hadoop.fs.gs.auth.service.account.private.key="$PRIVATE_KEY"
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 4.0.0-SNAPSHOT
  /_/

Using Scala version 2.13.13 (OpenJDK 64-Bit Server VM, Java 21.0.2)
Type in expressions to have them evaluated.
Type :help for more information.
{"ts":"2024-04-02T13:08:31.513-0700","level":"WARN","msg":"Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable","logger":"org.apache.hadoop.util.NativeCodeLoader"}
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1712088511841).
Spark session available as 'spark'.

scala> spark.read.text("gs://apache-spark-bucket/README.md").count()
val res0: Long = 124

scala> 
spark.read.orc("examples/src/main/resources/users.orc").write.mode("overwrite").orc("gs://apache-spark-bucket/users.orc")

scala> spark.read.orc("gs://apache-spark-bucket/users.orc").show()
+--+--++
|  name|favorite_color|favorite_numbers|
+--+--++
|Alyssa|  NULL|  [3, 9, 15, 20]|
|   Ben|   red|  []|
+--+--++
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45824 from dongjoon-hyun/SPARK-47699.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 +-
 pom.xml   | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index a564ec9f044a..c6913ceeff13 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -66,7 +66,7 @@ 
eclipse-collections-api/11.1.0//eclipse-collections-api-11.1.0.jar
 eclipse-collections/11.1.0//eclipse-collections-11.1.0.jar
 esdk-obs-java/3.20.4.2//esdk-obs-java-3.20.4.2.jar
 flatbuffers-java/23.5.26//flatbuffers-java-23.5.26.jar
-gcs-connector/hadoop3-2.2.20/shaded/gcs-connector-hadoop3-2.2.20-shaded.jar
+gcs-connector/hadoop3-2.2.21/shaded/gcs-connector-hadoop3-2.2.21-shaded.jar
 gmetric4j/1.0.10//gmetric4j-1.0.10.jar
 gson/2.2.4//gson-2.2.4.jar
 guava/14.0.1//guava-14.0.1.jar
diff --git a/pom.xml b/pom.xml
index b70d091796a5..ca949a05c81c 100644
--- a/pom.xml
+++ b/pom.xml
@@ -163,7 +163,8 @@
 2.24.6
 
 0.12.8
-hadoop3-2.2.20
+
+hadoop3-2.2.21
 
 4.5.14
 4.4.16


-
To unsubscribe, e-mail: 

(spark) branch master updated (db0975cb2a1c -> e6144e4d75b7)

2024-04-02 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from db0975cb2a1c [SPARK-47602][CORE] Resource managers: Migrate logError 
with variables to structured logging framework
 add e6144e4d75b7 [SPARK-47695][BUILD] Upgrade AWS SDK v2 to 2.24.6

No new revisions were added by this update.

Summary of changes:
 dev/deps/spark-deps-hadoop-3-hive-2.3 |  2 +-
 hadoop-cloud/pom.xml  | 11 +++
 pom.xml   |  2 +-
 3 files changed, 13 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (8d4e9647c971 -> db0975cb2a1c)

2024-04-02 Thread gengliang
This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 8d4e9647c971 [SPARK-47684][SQL] Postgres: Map length unspecified 
bpchar to StringType
 add db0975cb2a1c [SPARK-47602][CORE] Resource managers: Migrate logError 
with variables to structured logging framework

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/internal/LogKey.scala   | 14 ++-
 .../scala/org/apache/spark/internal/Logging.scala  |  7 ++--
 .../scala/org/apache/spark/util/MDCSuite.scala | 49 ++
 .../apache/spark/util/PatternLoggingSuite.scala|  6 ++-
 .../apache/spark/util/StructuredLoggingSuite.scala | 40 --
 .../cluster/k8s/ExecutorPodsAllocator.scala|  8 ++--
 .../k8s/integrationtest/DepsTestsSuite.scala   |  3 +-
 .../spark/deploy/yarn/ApplicationMaster.scala  | 11 +++--
 .../org/apache/spark/deploy/yarn/Client.scala  |  5 ++-
 .../apache/spark/deploy/yarn/YarnAllocator.scala   |  6 ++-
 .../cluster/YarnClientSchedulerBackend.scala   |  7 ++--
 11 files changed, 133 insertions(+), 23 deletions(-)
 create mode 100644 
common/utils/src/test/scala/org/apache/spark/util/MDCSuite.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (ba98b7ae53f8 -> 8d4e9647c971)

2024-04-02 Thread yao
This is an automated email from the ASF dual-hosted git repository.

yao pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from ba98b7ae53f8 [SPARK-47689][SQL] Do not wrap query execution error 
during data writing
 add 8d4e9647c971 [SPARK-47684][SQL] Postgres: Map length unspecified 
bpchar to StringType

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/jdbc/PostgresIntegrationSuite.scala  | 32 +++---
 .../apache/spark/sql/jdbc/PostgresDialect.scala|  4 +++
 2 files changed, 14 insertions(+), 22 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch branch-3.4 updated: [SPARK-47666][SQL][3.4] Fix NPE when reading mysql bit array as LongType

2024-04-02 Thread yao
This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 9f8eb54c7566 [SPARK-47666][SQL][3.4] Fix NPE when reading mysql bit 
array as LongType
9f8eb54c7566 is described below

commit 9f8eb54c7566a2f99396b615d11f57c458e30936
Author: Kent Yao 
AuthorDate: Tue Apr 2 20:48:15 2024 +0800

[SPARK-47666][SQL][3.4] Fix NPE when reading mysql bit array as LongType

### What changes were proposed in this pull request?

This PR fixes NPE when reading mysql bit array as LongType

### Why are the changes needed?

bugfix

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new tests
### Was this patch authored or co-authored using generative AI tooling?

no

Closes #45793 from yaooqinn/PR_TOOL_PICK_PR_45790_BRANCH-3.4.

Authored-by: Kent Yao 
Signed-off-by: Kent Yao 
---
 .../apache/spark/sql/jdbc/MySQLIntegrationSuite.scala  | 10 +-
 .../sql/execution/datasources/jdbc/JdbcUtils.scala | 18 ++
 2 files changed, 19 insertions(+), 9 deletions(-)

diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala
index d0fcbfb7aaa8..883bb0ce7bab 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala
@@ -61,6 +61,9 @@ class MySQLIntegrationSuite extends 
DockerJDBCIntegrationSuite {
   + "17, 7, 123456789, 123456789012345, 
123456789012345.123456789012345, "
   + "42.75, 1.0002)").executeUpdate()
 
+conn.prepareStatement("INSERT INTO numbers VALUES (null, null, "
+  + "null, null, null, null, null, null, null)").executeUpdate()
+
 conn.prepareStatement("CREATE TABLE dates (d DATE, t TIME, dt DATETIME, ts 
TIMESTAMP, "
   + "yr YEAR)").executeUpdate()
 conn.prepareStatement("INSERT INTO dates VALUES ('1991-11-09', '13:31:24', 
"
@@ -100,7 +103,7 @@ class MySQLIntegrationSuite extends 
DockerJDBCIntegrationSuite {
   test("Numeric types") {
 val df = sqlContext.read.jdbc(jdbcUrl, "numbers", new Properties)
 val rows = df.collect()
-assert(rows.length == 1)
+assert(rows.length == 2)
 val types = rows(0).toSeq.map(x => x.getClass.toString)
 assert(types.length == 9)
 assert(types(0).equals("class java.lang.Boolean"))
@@ -205,6 +208,11 @@ class MySQLIntegrationSuite extends 
DockerJDBCIntegrationSuite {
""".stripMargin.replaceAll("\n", " "))
 assert(sql("select x, y from queryOption").collect.toSet == expectedResult)
   }
+
+  test("SPARK-47666: Check nulls for result set getters") {
+val nulls = spark.read.jdbc(jdbcUrl, "numbers", new 
Properties).tail(1).head
+assert(nulls === Row(null, null, null, null, null, null, null, null, null))
+  }
 }
 
 /**
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
index 5c76c6b1095b..90778c3053dd 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
@@ -426,14 +426,16 @@ object JdbcUtils extends Logging with SQLConfHelper {
 
 case LongType if metadata.contains("binarylong") =>
   (rs: ResultSet, row: InternalRow, pos: Int) =>
-val bytes = rs.getBytes(pos + 1)
-var ans = 0L
-var j = 0
-while (j < bytes.length) {
-  ans = 256 * ans + (255 & bytes(j))
-  j = j + 1
-}
-row.setLong(pos, ans)
+val l = nullSafeConvert[Array[Byte]](rs.getBytes(pos + 1), bytes => {
+  var ans = 0L
+  var j = 0
+  while (j < bytes.length) {
+ans = 256 * ans + (255 & bytes(j))
+j = j + 1
+  }
+  ans
+})
+row.update(pos, l)
 
 case LongType =>
   (rs: ResultSet, row: InternalRow, pos: Int) =>


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-47689][SQL] Do not wrap query execution error during data writing

2024-04-02 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ba98b7ae53f8 [SPARK-47689][SQL] Do not wrap query execution error 
during data writing
ba98b7ae53f8 is described below

commit ba98b7ae53f8b106cb0cfd8f0b3fe1a8068c1ce6
Author: Wenchen Fan 
AuthorDate: Tue Apr 2 20:34:32 2024 +0800

[SPARK-47689][SQL] Do not wrap query execution error during data writing

### What changes were proposed in this pull request?

It's quite confusing to report `TASK_WRITE_FAILED` error when the error was 
caused by input query execution. This PR updates the error wrapping code to not 
wrap with `TASK_WRITE_FAILED` if the error was from input query execution.

### Why are the changes needed?

better error reporting

### Does this PR introduce _any_ user-facing change?

yes, now people won't see `TASK_WRITE_FAILED` error if the error was from 
input query.

### How was this patch tested?

updated tests

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #45797 from cloud-fan/write-error.

Authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
---
 .../execution/datasources/FileFormatWriter.scala   |  29 ++-
 .../apache/spark/sql/CharVarcharTestSuite.scala| 256 -
 .../sql/errors/QueryExecutionAnsiErrorsSuite.scala |  16 +-
 .../org/apache/spark/sql/sources/InsertSuite.scala |  40 ++--
 .../spark/sql/HiveCharVarcharTestSuite.scala   |  20 +-
 5 files changed, 89 insertions(+), 272 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
index 9dbadbd97ec7..1df63aa14b4b 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
@@ -41,7 +41,7 @@ import org.apache.spark.sql.errors.QueryExecutionErrors
 import org.apache.spark.sql.execution.{ProjectExec, SortExec, SparkPlan, 
SQLExecution, UnsafeExternalRowSorter}
 import org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec
 import org.apache.spark.sql.internal.SQLConf
-import org.apache.spark.util.{SerializableConfiguration, Utils}
+import org.apache.spark.util.{NextIterator, SerializableConfiguration, Utils}
 import org.apache.spark.util.ArrayImplicits._
 
 
@@ -401,9 +401,10 @@ object FileFormatWriter extends Logging {
   }
 
 try {
+  val queryFailureCapturedIterator = new 
QueryFailureCapturedIterator(iterator)
   Utils.tryWithSafeFinallyAndFailureCallbacks(block = {
 // Execute the task to write rows out and commit the task.
-dataWriter.writeWithIterator(iterator)
+dataWriter.writeWithIterator(queryFailureCapturedIterator)
 dataWriter.commit()
   })(catchBlock = {
 // If there is an error, abort the task
@@ -413,6 +414,8 @@ object FileFormatWriter extends Logging {
 dataWriter.close()
   })
 } catch {
+  case e: QueryFailureDuringWrite =>
+throw e.queryFailure
   case e: FetchFailedException =>
 throw e
   case f: FileAlreadyExistsException if 
SQLConf.get.fastFailFileFormatOutput =>
@@ -452,3 +455,25 @@ object FileFormatWriter extends Logging {
 }
   }
 }
+
+// A exception wrapper to indicate that the error was thrown when executing 
the query, not writing
+// the data
+private class QueryFailureDuringWrite(val queryFailure: Throwable) extends 
Throwable
+
+// An iterator wrapper to rethrow any error from the given iterator with 
`QueryFailureDuringWrite`.
+private class QueryFailureCapturedIterator(data: Iterator[InternalRow])
+  extends NextIterator[InternalRow] {
+
+  override protected def getNext(): InternalRow = try {
+if (data.hasNext) {
+  data.next()
+} else {
+  finished = true
+  null
+}
+  } catch {
+case t: Throwable => throw new QueryFailureDuringWrite(t)
+  }
+
+  override protected def close(): Unit = {}
+}
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/CharVarcharTestSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/CharVarcharTestSuite.scala
index 7e845e69c772..013177425da7 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/CharVarcharTestSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/CharVarcharTestSuite.scala
@@ -17,7 +17,7 @@
 
 package org.apache.spark.sql
 
-import org.apache.spark.{SparkConf, SparkException, SparkRuntimeException}
+import org.apache.spark.{SparkConf, SparkRuntimeException}
 import org.apache.spark.sql.catalyst.expressions.Attribute
 import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
 import 

(spark) branch master updated (03f4e45cd7e9 -> b1b1fde42bd8)

2024-04-02 Thread yao
This is an automated email from the ASF dual-hosted git repository.

yao pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 03f4e45cd7e9 [SPARK-47685][SQL] Restore the support for `Stream` type 
in `Dataset#groupBy`
 add b1b1fde42bd8 Revert "[SPARK-47684][SQL] Postgres: Map length 
unspecified bpchar to StringType"

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/jdbc/PostgresIntegrationSuite.scala  | 34 +++---
 .../apache/spark/sql/jdbc/PostgresDialect.scala|  4 ---
 2 files changed, 23 insertions(+), 15 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch branch-3.5 updated: [SPARK-47666][SQL][3.5] Fix NPE when reading mysql bit array as LongType

2024-04-02 Thread yao
This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 2dda441bec01 [SPARK-47666][SQL][3.5] Fix NPE when reading mysql bit 
array as LongType
2dda441bec01 is described below

commit 2dda441bec018b1386248b54ccfef46defa5a07a
Author: Kent Yao 
AuthorDate: Tue Apr 2 18:16:35 2024 +0800

[SPARK-47666][SQL][3.5] Fix NPE when reading mysql bit array as LongType

### What changes were proposed in this pull request?

This PR fixes NPE when reading mysql bit array as LongType

### Why are the changes needed?

bugfix

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new tests
### Was this patch authored or co-authored using generative AI tooling?

no

Closes #45792 from yaooqinn/PR_TOOL_PICK_PR_45790_BRANCH-3.5.

Authored-by: Kent Yao 
Signed-off-by: Kent Yao 
---
 .../apache/spark/sql/jdbc/MySQLIntegrationSuite.scala  | 10 +-
 .../sql/execution/datasources/jdbc/JdbcUtils.scala | 18 ++
 2 files changed, 19 insertions(+), 9 deletions(-)

diff --git 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala
 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala
index 68d88fbc552a..bc7302163d9a 100644
--- 
a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala
+++ 
b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala
@@ -62,6 +62,9 @@ class MySQLIntegrationSuite extends 
DockerJDBCIntegrationSuite {
   + "17, 7, 123456789, 123456789012345, 
123456789012345.123456789012345, "
   + "42.75, 1.0002, -128, 255)").executeUpdate()
 
+conn.prepareStatement("INSERT INTO numbers VALUES (null, null, "
+  + "null, null, null, null, null, null, null, null, 
null)").executeUpdate()
+
 conn.prepareStatement("CREATE TABLE dates (d DATE, t TIME, dt DATETIME, ts 
TIMESTAMP, "
   + "yr YEAR)").executeUpdate()
 conn.prepareStatement("INSERT INTO dates VALUES ('1991-11-09', '13:31:24', 
"
@@ -101,7 +104,7 @@ class MySQLIntegrationSuite extends 
DockerJDBCIntegrationSuite {
   test("Numeric types") {
 val df = sqlContext.read.jdbc(jdbcUrl, "numbers", new Properties)
 val rows = df.collect()
-assert(rows.length == 1)
+assert(rows.length == 2)
 val types = rows(0).toSeq.map(x => x.getClass.toString)
 assert(types.length == 11)
 assert(types(0).equals("class java.lang.Boolean"))
@@ -212,6 +215,11 @@ class MySQLIntegrationSuite extends 
DockerJDBCIntegrationSuite {
""".stripMargin.replaceAll("\n", " "))
 assert(sql("select x, y from queryOption").collect.toSet == expectedResult)
   }
+
+  test("SPARK-47666: Check nulls for result set getters") {
+val nulls = spark.read.jdbc(jdbcUrl, "numbers", new 
Properties).tail(1).head
+assert(nulls === Row(null, null, null, null, null, null, null, null, null, 
null, null))
+  }
 }
 
 /**
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
index 3521a50cd2dd..7b5c4cfc9b6e 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
@@ -430,14 +430,16 @@ object JdbcUtils extends Logging with SQLConfHelper {
 
 case LongType if metadata.contains("binarylong") =>
   (rs: ResultSet, row: InternalRow, pos: Int) =>
-val bytes = rs.getBytes(pos + 1)
-var ans = 0L
-var j = 0
-while (j < bytes.length) {
-  ans = 256 * ans + (255 & bytes(j))
-  j = j + 1
-}
-row.setLong(pos, ans)
+val l = nullSafeConvert[Array[Byte]](rs.getBytes(pos + 1), bytes => {
+  var ans = 0L
+  var j = 0
+  while (j < bytes.length) {
+ans = 256 * ans + (255 & bytes(j))
+j = j + 1
+  }
+  ans
+})
+row.update(pos, l)
 
 case LongType =>
   (rs: ResultSet, row: InternalRow, pos: Int) =>


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (a598f654066d -> 03f4e45cd7e9)

2024-04-02 Thread yangjie01
This is an automated email from the ASF dual-hosted git repository.

yangjie01 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from a598f654066d [SPARK-47664][PYTHON][CONNECT][TESTS][FOLLOW-UP] Add more 
tests
 add 03f4e45cd7e9 [SPARK-47685][SQL] Restore the support for `Stream` type 
in `Dataset#groupBy`

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/sql/RelationalGroupedDataset.scala | 4 +++-
 .../test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala | 8 +++-
 2 files changed, 10 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-47664][PYTHON][CONNECT][TESTS][FOLLOW-UP] Add more tests

2024-04-02 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a598f654066d [SPARK-47664][PYTHON][CONNECT][TESTS][FOLLOW-UP] Add more 
tests
a598f654066d is described below

commit a598f654066d79bb886c15db082f665eeaa77a55
Author: Ruifeng Zheng 
AuthorDate: Tue Apr 2 16:33:32 2024 +0800

[SPARK-47664][PYTHON][CONNECT][TESTS][FOLLOW-UP] Add more tests

### What changes were proposed in this pull request?
Add more tests

### Why are the changes needed?
for test coverage, to address 
https://github.com/apache/spark/pull/45788#discussion_r1546663788

### Does this PR introduce _any_ user-facing change?
no, test only

### How was this patch tested?
new tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #45809 from zhengruifeng/col_name_val_test.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../sql/tests/connect/test_connect_basic.py| 47 +-
 1 file changed, 46 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/sql/tests/connect/test_connect_basic.py 
b/python/pyspark/sql/tests/connect/test_connect_basic.py
index 786b9e2896c2..3b8e8165b4bf 100755
--- a/python/pyspark/sql/tests/connect/test_connect_basic.py
+++ b/python/pyspark/sql/tests/connect/test_connect_basic.py
@@ -1219,6 +1219,11 @@ class SparkConnectBasicTests(SparkConnectSQLTestCase):
 self.assert_eq(parse_attr_name("`a"), None)
 self.assert_eq(parse_attr_name("a`"), None)
 
+self.assert_eq(parse_attr_name("`a`.b"), ["a", "b"])
+self.assert_eq(parse_attr_name("`a`.`b`"), ["a", "b"])
+self.assert_eq(parse_attr_name("`a```.b"), ["a`", "b"])
+self.assert_eq(parse_attr_name("`a``.b"), None)
+
 self.assert_eq(parse_attr_name("a.b.c"), ["a", "b", "c"])
 self.assert_eq(parse_attr_name("`a`.`b`.`c`"), ["a", "b", "c"])
 self.assert_eq(parse_attr_name("a.`b`.c"), ["a", "b", "c"])
@@ -1284,7 +1289,6 @@ class SparkConnectBasicTests(SparkConnectSQLTestCase):
 self.assertTrue(verify_col_name("m.`s`.id", cdf.schema))
 self.assertTrue(verify_col_name("`m`.`s`.`id`", cdf.schema))
 self.assertFalse(verify_col_name("m.`s.id`", cdf.schema))
-self.assertFalse(verify_col_name("m.`s.id`", cdf.schema))
 
 self.assertTrue(verify_col_name("a", cdf.schema))
 self.assertTrue(verify_col_name("`a`", cdf.schema))
@@ -1294,6 +1298,47 @@ class SparkConnectBasicTests(SparkConnectSQLTestCase):
 self.assertTrue(verify_col_name("`a`.`v`", cdf.schema))
 self.assertFalse(verify_col_name("`a`.`x`", cdf.schema))
 
+cdf = (
+self.connect.range(10)
+.withColumn("v", CF.lit(123))
+.withColumn("s.s", CF.struct("id", "v"))
+.withColumn("m`", CF.struct("`s.s`", "v"))
+)
+
+# root
+# |-- id: long (nullable = false)
+# |-- v: string (nullable = false)
+# |-- s.s: struct (nullable = false)
+# ||-- id: long (nullable = false)
+# ||-- v: string (nullable = false)
+# |-- m`: struct (nullable = false)
+# ||-- s.s: struct (nullable = false)
+# |||-- id: long (nullable = false)
+# |||-- v: string (nullable = false)
+# ||-- v: string (nullable = false)
+
+self.assertFalse(verify_col_name("s", cdf.schema))
+self.assertFalse(verify_col_name("`s`", cdf.schema))
+self.assertFalse(verify_col_name("s.s", cdf.schema))
+self.assertFalse(verify_col_name("s.`s`", cdf.schema))
+self.assertFalse(verify_col_name("`s`.s", cdf.schema))
+self.assertTrue(verify_col_name("`s.s`", cdf.schema))
+
+self.assertFalse(verify_col_name("m", cdf.schema))
+self.assertFalse(verify_col_name("`m`", cdf.schema))
+self.assertTrue(verify_col_name("`m```", cdf.schema))
+
+self.assertFalse(verify_col_name("`m```.s", cdf.schema))
+self.assertFalse(verify_col_name("`m```.`s`", cdf.schema))
+self.assertFalse(verify_col_name("`m```.s.s", cdf.schema))
+self.assertFalse(verify_col_name("`m```.s.`s`", cdf.schema))
+self.assertTrue(verify_col_name("`m```.`s.s`", cdf.schema))
+
+self.assertFalse(verify_col_name("`m```.s.s.v", cdf.schema))
+self.assertFalse(verify_col_name("`m```.s.`s`.v", cdf.schema))
+self.assertTrue(verify_col_name("`m```.`s.s`.v", cdf.schema))
+self.assertTrue(verify_col_name("`m```.`s.s`.`v`", cdf.schema))
+
 
 if __name__ == "__main__":
 from pyspark.sql.tests.connect.test_connect_basic import *  # noqa: F401


-
To unsubscribe, e-mail: 

(spark) branch master updated (e2e6e09a24db -> 22771a6a87e7)

2024-04-02 Thread yao
This is an automated email from the ASF dual-hosted git repository.

yao pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from e2e6e09a24db [SPARK-47634][SQL] Add legacy support for disabling map 
key normalization
 add 22771a6a87e7 [SPARK-47684][SQL] Postgres: Map length unspecified 
bpchar to StringType

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/jdbc/PostgresIntegrationSuite.scala  | 34 +++---
 .../apache/spark/sql/jdbc/PostgresDialect.scala|  4 +++
 2 files changed, 15 insertions(+), 23 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-47634][SQL] Add legacy support for disabling map key normalization

2024-04-02 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e2e6e09a24db [SPARK-47634][SQL] Add legacy support for disabling map 
key normalization
e2e6e09a24db is described below

commit e2e6e09a24dbb828b55ee35f60da22c3df4ac2c5
Author: Stevo Mitric 
AuthorDate: Tue Apr 2 16:28:27 2024 +0800

[SPARK-47634][SQL] Add legacy support for disabling map key normalization

### What changes were proposed in this pull request?
Added a `DISABLE_MAP_KEY_NORMALIZATION` option in `SQLConf` to allow for 
legacy creation of a map without key normalization (keys `0.0` and `-0.0`) in 
`ArrayBasedMapBuilder`.

### Why are the changes needed?
As a legacy fallback option.

### Does this PR introduce _any_ user-facing change?
New `DISABLE_MAP_KEY_NORMALIZATION` config option.

### How was this patch tested?
New UT proposed in this PR

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #45760 from stevomitric/stevomitric/normalize-conf.

Authored-by: Stevo Mitric 
Signed-off-by: Wenchen Fan 
---
 docs/sql-migration-guide.md  |  1 +
 .../spark/sql/catalyst/util/ArrayBasedMapBuilder.scala   | 12 +++-
 .../main/scala/org/apache/spark/sql/internal/SQLConf.scala   | 10 ++
 .../spark/sql/catalyst/util/ArrayBasedMapBuilderSuite.scala  | 10 ++
 4 files changed, 28 insertions(+), 5 deletions(-)

diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index b22665487c7b..13d6702c4cf9 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -24,6 +24,7 @@ license: |
 
 ## Upgrading from Spark SQL 3.5 to 4.0
 
+- Since Spark 4.0, the default behaviour when inserting elements in a map is 
changed to first normalize keys -0.0 to 0.0. The affected SQL functions are 
`create_map`, `map_from_arrays`, `map_from_entries`, and `map_concat`. To 
restore the previous behaviour, set 
`spark.sql.legacy.disableMapKeyNormalization` to `true`.
 - Since Spark 4.0, the default value of `spark.sql.maxSinglePartitionBytes` is 
changed from `Long.MaxValue` to `128m`. To restore the previous behavior, set 
`spark.sql.maxSinglePartitionBytes` to `9223372036854775807`(`Long.MaxValue`).
 - Since Spark 4.0, any read of SQL tables takes into consideration the SQL 
configs 
`spark.sql.files.ignoreCorruptFiles`/`spark.sql.files.ignoreMissingFiles` 
instead of the core config 
`spark.files.ignoreCorruptFiles`/`spark.files.ignoreMissingFiles`.
 - Since Spark 4.0, `spark.sql.hive.metastore` drops the support of Hive prior 
to 2.0.0 as they require JDK 8 that Spark does not support anymore. Users 
should migrate to higher versions.
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala
index d13c3c6026a2..a2d41ebf04e1 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala
@@ -53,11 +53,13 @@ class ArrayBasedMapBuilder(keyType: DataType, valueType: 
DataType) extends Seria
 
   private val mapKeyDedupPolicy = 
SQLConf.get.getConf(SQLConf.MAP_KEY_DEDUP_POLICY)
 
-  private lazy val keyNormalizer: Any => Any = keyType match {
-case FloatType => NormalizeFloatingNumbers.FLOAT_NORMALIZER
-case DoubleType => NormalizeFloatingNumbers.DOUBLE_NORMALIZER
-case _ => identity
-  }
+  private lazy val keyNormalizer: Any => Any =
+(SQLConf.get.getConf(SQLConf.DISABLE_MAP_KEY_NORMALIZATION), keyType) 
match {
+  case (false, FloatType) => NormalizeFloatingNumbers.FLOAT_NORMALIZER
+  case (false, DoubleType) => NormalizeFloatingNumbers.DOUBLE_NORMALIZER
+  case _ => identity
+}
+
 
   def put(key: Any, value: Any): Unit = {
 if (key == null) {
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index fc7425ce2bea..af4498274620 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -3247,6 +3247,16 @@ object SQLConf {
   .stringConf
   .createWithDefault("")
 
+  val DISABLE_MAP_KEY_NORMALIZATION =
+buildConf("spark.sql.legacy.disableMapKeyNormalization")
+  .internal()
+  .doc("Disables key normalization when creating a map with 
`ArrayBasedMapBuilder`. When " +
+"set to `true` it will prevent key normalization when building a map, 
which will " +
+"allow for values such as `-0.0` and `0.0` to be present as distinct 

(spark) branch master updated: [SPARK-47663][CORE][TESTS] add end to end test for task limiting according to different cpu and gpu configurations

2024-04-02 Thread weichenxu123
This is an automated email from the ASF dual-hosted git repository.

weichenxu123 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new eb9b12692601 [SPARK-47663][CORE][TESTS] add end to end test for task 
limiting according to different cpu and gpu configurations
eb9b12692601 is described below

commit eb9b126926016e0156b1d40b1d7ed33d4705d2bb
Author: Bobby Wang 
AuthorDate: Tue Apr 2 15:30:10 2024 +0800

[SPARK-47663][CORE][TESTS] add end to end test for task limiting according 
to different cpu and gpu configurations

### What changes were proposed in this pull request?
Add an end-to-end unit test to ensure that the number of tasks is 
calculated correctly according to the different task CPU amound and task GPU 
amount.

### Why are the changes needed?
To increase the test coverage. More details can be found at 
https://github.com/apache/spark/pull/45528#discussion_r1545905575

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
The CI can pass.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #45794 from wbo4958/end2end-test.

Authored-by: Bobby Wang 
Signed-off-by: Weichen Xu 
---
 .../CoarseGrainedSchedulerBackendSuite.scala   | 47 ++
 1 file changed, 47 insertions(+)

diff --git 
a/core/src/test/scala/org/apache/spark/scheduler/CoarseGrainedSchedulerBackendSuite.scala
 
b/core/src/test/scala/org/apache/spark/scheduler/CoarseGrainedSchedulerBackendSuite.scala
index 6e94f9abe67b..a75f470deec3 100644
--- 
a/core/src/test/scala/org/apache/spark/scheduler/CoarseGrainedSchedulerBackendSuite.scala
+++ 
b/core/src/test/scala/org/apache/spark/scheduler/CoarseGrainedSchedulerBackendSuite.scala
@@ -179,6 +179,53 @@ class CoarseGrainedSchedulerBackendSuite extends 
SparkFunSuite with LocalSparkCo
 }
   }
 
+  // Every item corresponds to (CPU resources per task, GPU resources per task,
+  // and the GPU addresses assigned to all tasks).
+  Seq(
+(1, 1, Array(Array("0"), Array("1"), Array("2"), Array("3"))),
+(1, 2, Array(Array("0", "1"), Array("2", "3"))),
+(1, 4, Array(Array("0", "1", "2", "3"))),
+(2, 1, Array(Array("0"), Array("1"))),
+(4, 1, Array(Array("0"))),
+(4, 2, Array(Array("0", "1"))),
+(2, 2, Array(Array("0", "1"), Array("2", "3"))),
+(4, 4, Array(Array("0", "1", "2", "3"))),
+(1, 3, Array(Array("0", "1", "2"))),
+(3, 1, Array(Array("0")))
+  ).foreach { case (taskCpus, taskGpus, expectedGpuAddresses) =>
+test(s"SPARK-47663 end to end test validating if task cpus:${taskCpus} and 
" +
+  s"task gpus: ${taskGpus} works") {
+  withTempDir { dir =>
+val discoveryScript = createTempScriptWithExpectedOutput(
+  dir, "gpuDiscoveryScript", """{"name": "gpu","addresses":["0", "1", 
"2", "3"]}""")
+val conf = new SparkConf()
+  .set(CPUS_PER_TASK, taskCpus)
+  .setMaster("local-cluster[1, 4, 1024]")
+  .setAppName("test")
+  .set(WORKER_GPU_ID.amountConf, "4")
+  .set(WORKER_GPU_ID.discoveryScriptConf, discoveryScript)
+  .set(EXECUTOR_GPU_ID.amountConf, "4")
+  .set(TASK_GPU_ID.amountConf, taskGpus.toString)
+
+sc = new SparkContext(conf)
+eventually(timeout(executorUpTimeout)) {
+  // Ensure all executors have been launched.
+  assert(sc.getExecutorIds().length == 1)
+}
+
+val numPartitions = Seq(4 / taskCpus, 4 / taskGpus).min
+val ret = sc.parallelize(1 to 20, numPartitions).mapPartitions { _ =>
+  val tc = TaskContext.get()
+  assert(tc.cpus() == taskCpus)
+  val gpus = tc.resources()("gpu").addresses
+  Iterator.single(gpus)
+}.collect()
+
+assert(ret === expectedGpuAddresses)
+  }
+}
+  }
+
   // Here we just have test for one happy case instead of all cases: other 
cases are covered in
   // FsHistoryProviderSuite.
   test("custom log url for Spark UI is applied") {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-47686][SQL][TESTS] Use `=!=` instead of `!==` in `JoinHintSuite`

2024-04-02 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 00162b82fe3c [SPARK-47686][SQL][TESTS] Use `=!=` instead of `!==` in 
`JoinHintSuite`
00162b82fe3c is described below

commit 00162b82fe3c48e26303394d1a91d026fe8d9b4c
Author: yangjie01 
AuthorDate: Mon Apr 1 23:45:00 2024 -0700

[SPARK-47686][SQL][TESTS] Use `=!=` instead of `!==` in `JoinHintSuite`

### What changes were proposed in this pull request?
This pr use `=!=` instead of `!==` in `JoinHintSuite`. `!==`  is a 
deprecated API since 2.0.0, and its test already exists in `DeprecatedAPISuite`.

### Why are the changes needed?
Clean up deprecated API  usage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #45812 from LuciferYang/SPARK-47686.

Authored-by: yangjie01 
Signed-off-by: Dongjoon Hyun 
---
 sql/core/src/test/scala/org/apache/spark/sql/JoinHintSuite.scala | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/sql/core/src/test/scala/org/apache/spark/sql/JoinHintSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/JoinHintSuite.scala
index 66746263ff90..53e47f428c3a 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/JoinHintSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/JoinHintSuite.scala
@@ -695,11 +695,11 @@ class JoinHintSuite extends PlanTest with 
SharedSparkSession with AdaptiveSparkP
 val hintAppender = new LogAppender(s"join hint check for equi-join")
 withLogAppender(hintAppender, level = Some(Level.WARN)) {
   assertBroadcastNLJoin(
-df1.hint("SHUFFLE_HASH").join(df2, $"a1" !== $"b1"), BuildRight)
+df1.hint("SHUFFLE_HASH").join(df2, $"a1" =!= $"b1"), BuildRight)
 }
 withLogAppender(hintAppender, level = Some(Level.WARN)) {
   assertBroadcastNLJoin(
-df1.join(df2.hint("MERGE"), $"a1" !== $"b1"), BuildRight)
+df1.join(df2.hint("MERGE"), $"a1" =!= $"b1"), BuildRight)
 }
 val logs = hintAppender.loggingEvents.map(_.getMessage.getFormattedMessage)
   .filter(_.contains("is not supported in the query:"))


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org