date:20240328

(spark) branch master updated: [SPARK-47574][INFRA] Introduce Structured Logging Framework

2024-03-28 Thread gengliang

This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 874d033fc61b [SPARK-47574][INFRA] Introduce Structured Logging 
Framework
874d033fc61b is described below

commit 874d033fc61becb5679db70c804592a0f9cc37ed
Author: Gengliang Wang 
AuthorDate: Thu Mar 28 22:58:51 2024 -0700

[SPARK-47574][INFRA] Introduce Structured Logging Framework

### What changes were proposed in this pull request?

Introduce Structured Logging Framework as per [SPIP: Structured Logging 
Framework for Apache 
Spark](https://docs.google.com/document/d/1rATVGmFLNVLmtxSpWrEceYm7d-ocgu8ofhryVs4g3XU/edit?usp=sharing)
 .
* The default logging output format will be json lines. For example
```
{
   "ts":"2023-03-12T12:02:46.661-0700",
   "level":"ERROR",
   "msg":"Cannot determine whether executor 289 is alive or not",
   "context":{
   "executor_id":"289"
   },
   "exception":{
  "class":"org.apache.spark.SparkException",
  "msg":"Exception thrown in awaitResult",
  "stackTrace":"..."
   },
   "source":"BlockManagerMasterEndpoint"
}
```
* Introduce a new configuration `spark.log.structuredLogging.enabled` to 
set the default log4j configuration. It is true by default. Users can disable 
it to get plain text log outputs.
* The change will start with the `logError` method. Example changes on the 
API:
from
```
logError(s"Cannot determine whether executor $executorId is alive or not.", 
e)
```
to
```
logError(log"Cannot determine whether executor ${MDC(EXECUTOR_ID, 
executorId)} is alive or not.", e)
```

### Why are the changes needed?

To enhance Apache Spark's logging system by implementing structured 
logging. This transition will change the format of the default log output from 
plain text to JSON lines, making it more analyzable.

### Does this PR introduce _any_ user-facing change?

Yes, the default log output format will be json lines instead of plain 
text. User can restore the default plain text output when disabling 
configuration `spark.log.structuredLogging.enabled`.
If a user is a customized log4j configuration, there is no changes in the 
log output.

### How was this patch tested?

New Unit tests

### Was this patch authored or co-authored using generative AI tooling?

Yes, some of the code comments are from github copilot

Closes #45729 from gengliangwang/LogInterpolator.

Authored-by: Gengliang Wang 
Signed-off-by: Gengliang Wang 
---
 common/utils/pom.xml   |   4 +
 .../resources/org/apache/spark/SparkLayout.json|  38 
 .../org/apache/spark/log4j2-defaults.properties|   4 +-
 ...s => log4j2-pattern-layout-defaults.properties} |   0
 .../scala/org/apache/spark/internal/LogKey.scala   |  25 +
 .../scala/org/apache/spark/internal/Logging.scala  | 105 -
 common/utils/src/test/resources/log4j2.properties  |  50 ++
 .../apache/spark/util/PatternLoggingSuite.scala|  58 
 .../apache/spark/util/StructuredLoggingSuite.scala |  83 
 .../org/apache/spark/deploy/SparkSubmit.scala  |   5 +
 .../org/apache/spark/internal/config/package.scala |  10 ++
 dev/deps/spark-deps-hadoop-3-hive-2.3  |   1 +
 docs/core-migration-guide.md   |   2 +
 pom.xml|   5 +
 14 files changed, 386 insertions(+), 4 deletions(-)

diff --git a/common/utils/pom.xml b/common/utils/pom.xml
index d360e041dd64..1dbf2a769fff 100644
--- a/common/utils/pom.xml
+++ b/common/utils/pom.xml
@@ -98,6 +98,10 @@
   org.apache.logging.log4j
   log4j-1.2-api
 
+
+  org.apache.logging.log4j
+  log4j-layout-template-json
+
   
   
 
target/scala-${scala.binary.version}/classes
diff --git a/common/utils/src/main/resources/org/apache/spark/SparkLayout.json 
b/common/utils/src/main/resources/org/apache/spark/SparkLayout.json
new file mode 100644
index ..b0d8ea27ffbc
--- /dev/null
+++ b/common/utils/src/main/resources/org/apache/spark/SparkLayout.json
@@ -0,0 +1,38 @@
+{
+  "ts": {
+"$resolver": "timestamp"
+  },
+  "level": {
+"$resolver": "level",
+"field": "name"
+  },
+  "msg": {
+"$resolver": "message",
+"stringified": true
+  },
+  "context": {
+"$resolver": "mdc"
+  },
+  "exception": {
+"class": {
+  "$resolver": "exception",
+  "field": "className"
+},
+"msg": {
+  "$resolver": "exception",
+  "field": "message",
+  "stringified": true
+},
+"stacktrace": {
+  "$resolver": "exception",
+  "field": "stackTrace",
+  "stackTrace": {
+"stringified":

(spark) branch master updated: [SPARK-47629][INFRA] Add `common/variant` and `connector/kinesis-asl` to maven daily test module list

2024-03-28 Thread yangjie01

This is an automated email from the ASF dual-hosted git repository.

yangjie01 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a8b247e9a50a [SPARK-47629][INFRA] Add `common/variant` and 
`connector/kinesis-asl` to maven daily test module list
a8b247e9a50a is described below

commit a8b247e9a50ae0450360e76bc69b2c6cdf5ea6f8
Author: yangjie01 
AuthorDate: Fri Mar 29 13:26:40 2024 +0800

[SPARK-47629][INFRA] Add `common/variant` and `connector/kinesis-asl` to 
maven daily test module list

### What changes were proposed in this pull request?
This pr add `common/variant` and `connector/kinesis-asl` to maven daily 
test module list.

### Why are the changes needed?
Synchronize the modules to be tested in Maven daily test

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Monitor GA after merge

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #45754 from LuciferYang/SPARK-47629.

Authored-by: yangjie01 
Signed-off-by: yangjie01 
---
 .github/workflows/maven_test.yml | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/.github/workflows/maven_test.yml b/.github/workflows/maven_test.yml
index 34fa9a8b7768..b01f08a23e47 100644
--- a/.github/workflows/maven_test.yml
+++ b/.github/workflows/maven_test.yml
@@ -62,7 +62,7 @@ jobs:
   - hive2.3
 modules:
   - >-
-
core,launcher,common#unsafe,common#kvstore,common#network-common,common#network-shuffle,common#sketch,common#utils
+
core,launcher,common#unsafe,common#kvstore,common#network-common,common#network-shuffle,common#sketch,common#utils,common#variant
   - >-
 graphx,streaming,hadoop-cloud
   - >-
@@ -70,7 +70,7 @@ jobs:
   - >-
 repl,sql#hive-thriftserver
   - >-
-
connector#kafka-0-10,connector#kafka-0-10-sql,connector#kafka-0-10-token-provider,connector#spark-ganglia-lgpl,connector#protobuf,connector#avro
+
connector#kafka-0-10,connector#kafka-0-10-sql,connector#kafka-0-10-token-provider,connector#spark-ganglia-lgpl,connector#protobuf,connector#avro,connector#kinesis-asl
   - >-
 
sql#api,sql#catalyst,resource-managers#yarn,resource-managers#kubernetes#core
 # Here, we split Hive and SQL tests into some of slow ones and the 
rest of them.
@@ -188,20 +188,21 @@ jobs:
   export MAVEN_OPTS="-Xss64m -Xmx4g -Xms4g 
-XX:ReservedCodeCacheSize=128m -Dorg.slf4j.simpleLogger.defaultLogLevel=WARN"
   export MAVEN_CLI_OPTS="--no-transfer-progress"
   export JAVA_VERSION=${{ matrix.java }}
+  export ENABLE_KINESIS_TESTS=0
   # Replace with the real module name, for example, 
connector#kafka-0-10 -> connector/kafka-0-10
   export TEST_MODULES=`echo "$MODULES_TO_TEST" | sed -e "s%#%/%g"`
-  ./build/mvn $MAVEN_CLI_OPTS -DskipTests -Pyarn -Pkubernetes 
-Pvolcano -Phive -Phive-thriftserver -Phadoop-cloud -Pspark-ganglia-lgpl 
-Djava.version=${JAVA_VERSION/-ea} clean install
+  ./build/mvn $MAVEN_CLI_OPTS -DskipTests -Pyarn -Pkubernetes 
-Pvolcano -Phive -Phive-thriftserver -Phadoop-cloud -Pspark-ganglia-lgpl 
-Pkinesis-asl -Djava.version=${JAVA_VERSION/-ea} clean install
   if [[ "$INCLUDED_TAGS" != "" ]]; then
-./build/mvn $MAVEN_CLI_OPTS -pl "$TEST_MODULES" -Pyarn 
-Pkubernetes -Pvolcano -Phive -Phive-thriftserver -Phadoop-cloud 
-Pspark-ganglia-lgpl -Djava.version=${JAVA_VERSION/-ea} 
-Dtest.include.tags="$INCLUDED_TAGS" test -fae
+./build/mvn $MAVEN_CLI_OPTS -pl "$TEST_MODULES" -Pyarn 
-Pkubernetes -Pvolcano -Phive -Phive-thriftserver -Phadoop-cloud 
-Pspark-ganglia-lgpl -Pkinesis-asl -Djava.version=${JAVA_VERSION/-ea} 
-Dtest.include.tags="$INCLUDED_TAGS" test -fae
   elif [[ "$MODULES_TO_TEST" == "connect" ]]; then
 ./build/mvn $MAVEN_CLI_OPTS -Dtest.exclude.tags="$EXCLUDED_TAGS" 
-Djava.version=${JAVA_VERSION/-ea} -pl 
connector/connect/client/jvm,connector/connect/common,connector/connect/server 
test -fae
   elif [[ "$EXCLUDED_TAGS" != "" ]]; then
-./build/mvn $MAVEN_CLI_OPTS -pl "$TEST_MODULES" -Pyarn 
-Pkubernetes -Pvolcano -Phive -Phive-thriftserver -Phadoop-cloud 
-Pspark-ganglia-lgpl -Djava.version=${JAVA_VERSION/-ea} 
-Dtest.exclude.tags="$EXCLUDED_TAGS" test -fae
+./build/mvn $MAVEN_CLI_OPTS -pl "$TEST_MODULES" -Pyarn 
-Pkubernetes -Pvolcano -Phive -Phive-thriftserver -Phadoop-cloud 
-Pspark-ganglia-lgpl -Pkinesis-asl -Djava.version=${JAVA_VERSION/-ea} 
-Dtest.exclude.tags="$EXCLUDED_TAGS" test -fae
   elif [[ "$MODULES_TO_TEST" == *"sql#hive-thriftserver"* ]]; then
 # To avoid a compilation loop, for the `sql/hive-thriftserver` 
module,

(spark) branch master updated: [SPARK-47568][SS] Fix race condition between maintenance thread and load/commit for snapshot files

2024-03-28 Thread kabhwan

This is an automated email from the ASF dual-hosted git repository.

kabhwan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0b844e52b35b [SPARK-47568][SS] Fix race condition between maintenance 
thread and load/commit for snapshot files
0b844e52b35b is described below

commit 0b844e52b35b0491717ba9f0ae8fe2e0cf45e88d
Author: Bhuwan Sahni 
AuthorDate: Fri Mar 29 13:23:15 2024 +0900

[SPARK-47568][SS] Fix race condition between maintenance thread and 
load/commit for snapshot files

### What changes were proposed in this pull request?

This PR fixes a race condition between the maintenance thread and task 
thread when change-log checkpointing is enabled, and ensure all snapshots are 
valid.

1. The maintenance thread currently relies on class variable lastSnapshot 
to find the latest checkpoint and uploads it to DFS. This checkpoint can be 
modified at commit time by Task thread if a new snapshot is created.
2. The task thread was not resetting the lastSnapshot at load time, which 
can result in newer snapshots (if a old version is loaded) being considered 
valid and uploaded to DFS. This results in VersionIdMismatch errors.

### Why are the changes needed?

These are logical bugs which can cause `VersionIdMismatch` errors causing 
user to discard the snapshot and restart the query.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added unit test cases.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #45724 from sahnib/rocks-db-fix.

Authored-by: Bhuwan Sahni 
Signed-off-by: Jungtaek Lim 
---
 .../sql/execution/streaming/state/RocksDB.scala| 66 ++
 .../streaming/state/RocksDBFileManager.scala   |  3 +-
 .../execution/streaming/state/RocksDBSuite.scala   | 37 
 3 files changed, 82 insertions(+), 24 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala
index 1817104a5c22..fcefc1666f3a 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala
@@ -19,6 +19,7 @@ package org.apache.spark.sql.execution.streaming.state
 
 import java.io.File
 import java.util.Locale
+import java.util.concurrent.TimeUnit
 import javax.annotation.concurrent.GuardedBy
 
 import scala.collection.{mutable, Map}
@@ -49,6 +50,7 @@ case object RollbackStore extends 
RocksDBOpType("rollback_store")
 case object CloseStore extends RocksDBOpType("close_store")
 case object ReportStoreMetrics extends RocksDBOpType("report_store_metrics")
 case object StoreTaskCompletionListener extends 
RocksDBOpType("store_task_completion_listener")
+case object StoreMaintenance extends RocksDBOpType("store_maintenance")
 
 /**
  * Class representing a RocksDB instance that checkpoints version of data to 
DFS.
@@ -184,19 +186,23 @@ class RocksDB(
 loadedVersion = latestSnapshotVersion
 
 // reset last snapshot version
-lastSnapshotVersion = 0L
+if (lastSnapshotVersion > latestSnapshotVersion) {
+  // discard any newer snapshots
+  lastSnapshotVersion = 0L
+  latestSnapshot = None
+}
 openDB()
 
 numKeysOnWritingVersion = if (!conf.trackTotalNumberOfRows) {
-  // we don't track the total number of rows - discard the number 
being track
-  -1L
-} else if (metadata.numKeys < 0) {
-  // we track the total number of rows, but the snapshot doesn't have 
tracking number
-  // need to count keys now
-  countKeys()
-} else {
-  metadata.numKeys
-}
+// we don't track the total number of rows - discard the number 
being track
+-1L
+  } else if (metadata.numKeys < 0) {
+// we track the total number of rows, but the snapshot doesn't 
have tracking number
+// need to count keys now
+countKeys()
+  } else {
+metadata.numKeys
+  }
 if (loadedVersion != version) replayChangelog(version)
 // After changelog replay the numKeysOnWritingVersion will be updated 
to
 // the correct number of keys in the loaded version.
@@ -571,16 +577,14 @@ class RocksDB(
   // background operations.
   val cp = Checkpoint.create(db)
   cp.createCheckpoint(checkpointDir.toString)
-  synchronized {
-// if changelog checkpointing is disabled, the snapshot is 
uploaded synchronously
-// inside the uploadSnapshot() called below.
-// If

(spark) branch master updated (a57f94b03e30 -> b3146721552f)

2024-03-28 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from a57f94b03e30 [SPARK-47638][PS][CONNECT] Skip column name validation in 
PS
 add b3146721552f [SPARK-47511][SQL] Canonicalize With expressions by 
re-assigning IDs

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/expressions/With.scala  |  81 ++--
 .../sql/catalyst/expressions/nullExpressions.scala |  15 ++-
 .../catalyst/optimizer/RewriteWithExpression.scala |  10 +-
 .../org/apache/spark/sql/internal/SQLConf.scala|   8 ++
 .../catalyst/expressions/CanonicalizeSuite.scala   | 105 -
 5 files changed, 205 insertions(+), 14 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-47638][PS][CONNECT] Skip column name validation in PS

2024-03-28 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a57f94b03e30 [SPARK-47638][PS][CONNECT] Skip column name validation in 
PS
a57f94b03e30 is described below

commit a57f94b03e302e97c1d650a9f64596a82506df2f
Author: Ruifeng Zheng 
AuthorDate: Fri Mar 29 10:51:56 2024 +0800

[SPARK-47638][PS][CONNECT] Skip column name validation in PS

### What changes were proposed in this pull request?
Skip column name validation in PS

### Why are the changes needed?
`scol_for` is an internal method, not exposed to users, so this eager 
validation seems unnecessary

when a bad column name is used

before: `scol_for` immediately fails

after: silent at `scol_for` call, fail later at analysis (e.g. 
dtypes/schema) or execution

### Does this PR introduce _any_ user-facing change?
no, test only

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #45752 from zhengruifeng/test_avoid_validation.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/pandas/utils.py  |  5 -
 python/pyspark/sql/connect/dataframe.py | 24 +++-
 2 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/python/pyspark/pandas/utils.py b/python/pyspark/pandas/utils.py
index 57c1ddbe6ae3..0fe2944bcabe 100644
--- a/python/pyspark/pandas/utils.py
+++ b/python/pyspark/pandas/utils.py
@@ -608,7 +608,10 @@ def lazy_property(fn: Callable[[Any], Any]) -> property:
 
 def scol_for(sdf: PySparkDataFrame, column_name: str) -> Column:
 """Return Spark Column for the given column name."""
-return sdf["`{}`".format(column_name)]
+if is_remote():
+return sdf._col("`{}`".format(column_name))  # type: ignore[operator]
+else:
+return sdf["`{}`".format(column_name)]
 
 
 def column_labels_level(column_labels: List[Label]) -> int:
diff --git a/python/pyspark/sql/connect/dataframe.py 
b/python/pyspark/sql/connect/dataframe.py
index 672ac8b9c25c..b2d0cc5fedca 100644
--- a/python/pyspark/sql/connect/dataframe.py
+++ b/python/pyspark/sql/connect/dataframe.py
@@ -645,7 +645,7 @@ class DataFrame:
 if isinstance(col, Column):
 return col
 else:
-return Column(ColumnReference(col, df._plan._plan_id))
+return df._col(col)
 
 return DataFrame(
 plan.AsOfJoin(
@@ -1715,12 +1715,7 @@ class DataFrame:
 error_class="ATTRIBUTE_NOT_SUPPORTED", 
message_parameters={"attr_name": name}
 )
 
-return Column(
-ColumnReference(
-unparsed_identifier=name,
-plan_id=self._plan._plan_id,
-)
-)
+return self._col(name)
 
 __getattr__.__doc__ = PySparkDataFrame.__getattr__.__doc__
 
@@ -1756,12 +1751,7 @@ class DataFrame:
 if item not in self.columns:
 self.select(item).isLocal()
 
-return Column(
-ColumnReference(
-unparsed_identifier=item,
-plan_id=self._plan._plan_id,
-)
-)
+return self._col(item)
 elif isinstance(item, Column):
 return self.filter(item)
 elif isinstance(item, (list, tuple)):
@@ -1774,6 +1764,14 @@ class DataFrame:
 message_parameters={"arg_name": "item", "arg_type": 
type(item).__name__},
 )
 
+def _col(self, name: str) -> Column:
+return Column(
+ColumnReference(
+unparsed_identifier=name,
+plan_id=self._plan._plan_id,
+)
+)
+
 def __dir__(self) -> List[str]:
 attrs = set(super().__dir__())
 attrs.update(self.columns)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (fde7a7f3d06b -> 41cbd56ba725)

2024-03-28 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from fde7a7f3d06b [SPARK-47546][SQL][FOLLOWUP] Improve validation when 
reading Variant from Parquet using non-vectorized reader
 add 41cbd56ba725 [SPARK-47525][SQL] Support subquery correlation joining 
on map attributes

No new revisions were added by this update.

Summary of changes:
 .../sql/catalyst/expressions/DynamicPruning.scala  |   7 +
 .../FunctionTableSubqueryArgumentExpression.scala  |   2 +
 .../spark/sql/catalyst/expressions/subquery.scala  |   9 +
 .../spark/sql/catalyst/optimizer/Optimizer.scala   |   1 +
 .../PullOutNestedDataOuterRefExpressions.scala | 136 
 .../org/apache/spark/sql/internal/SQLConf.scala|   9 +
 .../subquery/subquery-nested-data.sql.out  | 350 +
 .../inputs/subquery/subquery-nested-data.sql   |  50 +++
 .../results/subquery/subquery-nested-data.sql.out  | 314 ++
 .../scala/org/apache/spark/sql/SubquerySuite.scala |  49 +--
 10 files changed, 909 insertions(+), 18 deletions(-)
 create mode 100644 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PullOutNestedDataOuterRefExpressions.scala
 create mode 100644 
sql/core/src/test/resources/sql-tests/analyzer-results/subquery/subquery-nested-data.sql.out
 create mode 100644 
sql/core/src/test/resources/sql-tests/inputs/subquery/subquery-nested-data.sql
 create mode 100644 
sql/core/src/test/resources/sql-tests/results/subquery/subquery-nested-data.sql.out


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (6cf5f9508f38 -> fde7a7f3d06b)

2024-03-28 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 6cf5f9508f38 [SPARK-47623][PYTHON][CONNECT][TESTS] Use `QuietTest` in 
parity tests
 add fde7a7f3d06b [SPARK-47546][SQL][FOLLOWUP] Improve validation when 
reading Variant from Parquet using non-vectorized reader

No new revisions were added by this update.

Summary of changes:
 .../datasources/parquet/ParquetRowConverter.scala  | 38 +-
 .../scala/org/apache/spark/sql/VariantSuite.scala  | 21 +++-
 2 files changed, 43 insertions(+), 16 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (009b50e87c05 -> 6cf5f9508f38)

2024-03-28 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 009b50e87c05 [SPARK-47631][SQL] Remove unused 
`SQLConf.parquetOutputCommitterClass` method
 add 6cf5f9508f38 [SPARK-47623][PYTHON][CONNECT][TESTS] Use `QuietTest` in 
parity tests

No new revisions were added by this update.

Summary of changes:
 .../pyspark/sql/tests/connect/test_parity_arrow.py |  9 -
 .../connect/test_parity_arrow_cogrouped_map.py |  6 ++
 .../tests/connect/test_parity_arrow_grouped_map.py |  6 ++
 .../sql/tests/connect/test_parity_arrow_map.py |  3 +--
 .../sql/tests/connect/test_parity_collection.py|  3 ---
 .../connect/test_parity_pandas_cogrouped_map.py| 13 ++--
 .../connect/test_parity_pandas_grouped_map.py  | 23 --
 .../sql/tests/connect/test_parity_pandas_map.py| 20 +--
 .../connect/test_parity_pandas_udf_grouped_agg.py  |  5 +
 .../tests/connect/test_parity_pandas_udf_scalar.py | 17 +---
 .../tests/connect/test_parity_pandas_udf_window.py |  2 +-
 .../pyspark/sql/tests/connect/test_parity_udf.py   | 12 ---
 .../sql/tests/pandas/test_pandas_cogrouped_map.py  | 15 +++---
 .../sql/tests/pandas/test_pandas_grouped_map.py| 21 ++--
 python/pyspark/sql/tests/pandas/test_pandas_map.py | 16 +++
 python/pyspark/sql/tests/pandas/test_pandas_udf.py |  3 +--
 .../tests/pandas/test_pandas_udf_grouped_agg.py|  6 +++---
 .../sql/tests/pandas/test_pandas_udf_scalar.py | 18 -
 .../sql/tests/pandas/test_pandas_udf_window.py |  4 ++--
 python/pyspark/sql/tests/test_arrow.py | 13 ++--
 .../pyspark/sql/tests/test_arrow_cogrouped_map.py  | 15 +-
 python/pyspark/sql/tests/test_arrow_grouped_map.py | 15 +-
 python/pyspark/sql/tests/test_arrow_map.py |  3 +--
 python/pyspark/sql/tests/test_collection.py|  5 ++---
 python/pyspark/sql/tests/test_creation.py  |  3 +--
 python/pyspark/sql/tests/test_udf.py   | 10 +-
 python/pyspark/testing/connectutils.py |  6 ++
 python/pyspark/testing/utils.py|  5 +
 28 files changed, 91 insertions(+), 186 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (7f1eadcd9cbb -> 009b50e87c05)

2024-03-28 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 7f1eadcd9cbb [SPARK-47366][PYTHON] Add pyspark and dataframe 
parse_json aliases
 add 009b50e87c05 [SPARK-47631][SQL] Remove unused 
`SQLConf.parquetOutputCommitterClass` method

No new revisions were added by this update.

Summary of changes:
 sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala | 2 --
 1 file changed, 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (ca0001345d0b -> 7f1eadcd9cbb)

2024-03-28 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from ca0001345d0b [SPARK-47635][K8S] Use Java `21` instead of `21-jre` 
image in K8s Dockerfile
 add 7f1eadcd9cbb [SPARK-47366][PYTHON] Add pyspark and dataframe 
parse_json aliases

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/sql/functions.scala | 11 
 .../source/reference/pyspark.sql/functions.rst |  1 +
 python/pyspark/sql/connect/functions/builtin.py|  7 ++
 python/pyspark/sql/functions/builtin.py| 29 ++
 python/pyspark/sql/tests/test_functions.py | 10 
 .../scala/org/apache/spark/sql/functions.scala | 10 
 .../scala/org/apache/spark/sql/VariantSuite.scala  | 15 ++-
 7 files changed, 82 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-47636][K8S][3.5] Use Java `17` instead of `17-jre` image in K8s Dockerfile

2024-03-28 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new edae8edc3a45 [SPARK-47636][K8S][3.5] Use Java `17` instead of `17-jre` 
image in K8s Dockerfile
edae8edc3a45 is described below

commit edae8edc3a45a860a9402018cb44760266515154
Author: Dongjoon Hyun 
AuthorDate: Thu Mar 28 16:34:25 2024 -0700

[SPARK-47636][K8S][3.5] Use Java `17` instead of `17-jre` image in K8s 
Dockerfile

### What changes were proposed in this pull request?

This PR aims to use Java 21 instead of 21-jre in K8s Dockerfile .

### Why are the changes needed?

Since Apache Spark 3.5.0, SPARK-44153 starts to use `jmap` like the 
following.


https://github.com/apache/spark/blob/c832e2ac1d04668c77493577662c639785808657/core/src/main/scala/org/apache/spark/util/Utils.scala#L2030

```
$ docker run -it --rm eclipse-temurin:17-jre jmap
/__cacert_entrypoint.sh: line 30: exec: jmap: not found
```

```
$ docker run -it --rm eclipse-temurin:17 jmap | head -n2
Usage:
jmap -clstats 
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs and manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45762 from dongjoon-hyun/SPARK-47636.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../kubernetes/docker/src/main/dockerfiles/spark/Dockerfile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile 
b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile
index 88304c87a79c..22d8f1550128 100644
--- a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile
+++ b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile
@@ -14,7 +14,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-ARG java_image_tag=17-jre
+ARG java_image_tag=17
 
 FROM eclipse-temurin:${java_image_tag}
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (c832e2ac1d04 -> ca0001345d0b)

2024-03-28 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from c832e2ac1d04 [SPARK-47492][SQL] Widen whitespace rules in lexer
 add ca0001345d0b [SPARK-47635][K8S] Use Java `21` instead of `21-jre` 
image in K8s Dockerfile

No new revisions were added by this update.

Summary of changes:
 .../kubernetes/docker/src/main/dockerfiles/spark/Dockerfile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-47492][SQL] Widen whitespace rules in lexer

2024-03-28 Thread gengliang

This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c832e2ac1d04 [SPARK-47492][SQL] Widen whitespace rules in lexer
c832e2ac1d04 is described below

commit c832e2ac1d04668c77493577662c639785808657
Author: Serge Rielau 
AuthorDate: Thu Mar 28 15:51:32 2024 -0700

[SPARK-47492][SQL] Widen whitespace rules in lexer

### What changes were proposed in this pull request?

In this pull PR we extend the Lexer's understanding of WhiteSpace (what 
separates tokens) from the ASCII: ,  to the various Unicode 
flavors of "space" such as "narrow" and "wide".

### Why are the changes needed?

SQL statements are frequently copy pasted from various sources. Many of 
these sources are "rich text" and based on Unicode.
When doing do it is inevitable that non ASCII whitespace characters are 
copied.
This results today in often incomprehensible syntax errors.
Incomprehensible because the error message prints the "bad" whitespace just 
like an ASCII whitespace.
So the user stands little chance to find root cause unless they use 
possible editor options to to highlight non ASCII space or they, by sheer luck, 
happen to remove the whitespace.

So in this PR we acknowledge the reality and stop "discriminating" against 
non-ASCII whitespace.

### Does this PR introduce _any_ user-facing change?

Queries that used to fail before with a Syntax error, now succeed.

### How was this patch tested?

Added a new set of unit tests in SparkSQLParserSuite

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #45620 from srielau/SPARK-47492-Widen-whitespace-rules-in-lexer.

Lead-authored-by: Serge Rielau 
Co-authored-by: Serge Rielau 
Signed-off-by: Gengliang Wang 
---
 .../spark/sql/catalyst/parser/SqlBaseLexer.g4  |  2 +-
 .../spark/sql/execution/SparkSqlParserSuite.scala  | 80 ++
 2 files changed, 81 insertions(+), 1 deletion(-)

diff --git 
a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4 
b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4
index 7c376e226850..f5565f0a63fb 100644
--- 
a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4
+++ 
b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4
@@ -554,7 +554,7 @@ BRACKETED_COMMENT
 ;
 
 WS
-: [ \r\n\t]+ -> channel(HIDDEN)
+: [ 
\t\n\f\r\u000B\u00A0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u2028\u202F\u205F\u3000]+
 -> channel(HIDDEN)
 ;
 
 // Catch-all for anything we can't recognize.
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala
index c3768afa90f1..f60df77b7e9b 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala
@@ -800,4 +800,84 @@ class SparkSqlParserSuite extends AnalysisTest with 
SharedSparkSession {
 start = 0,
 stop = 63))
   }
+
+  test("verify whitespace handling - standard whitespace") {
+parser.parsePlan("SELECT 1") // ASCII space
+parser.parsePlan("SELECT\r1") // ASCII carriage return
+parser.parsePlan("SELECT\n1") // ASCII line feed
+parser.parsePlan("SELECT\t1") // ASCII tab
+parser.parsePlan("SELECT\u000B1") // ASCII vertical tab
+parser.parsePlan("SELECT\f1") // ASCII form feed
+  }
+
+  // Need to switch off scala style for Unicode characters
+  // scalastyle:off
+  test("verify whitespace handling - Unicode no-break space") {
+parser.parsePlan("SELECT\u00A01") // Unicode no-break space
+  }
+
+  test("verify whitespace handling - Unicode ogham space mark") {
+parser.parsePlan("SELECT\u16801") // Unicode ogham space mark
+  }
+
+  test("verify whitespace handling - Unicode en quad") {
+parser.parsePlan("SELECT\u20001") // Unicode en quad
+  }
+
+  test("verify whitespace handling - Unicode em quad") {
+parser.parsePlan("SELECT\u20011") // Unicode em quad
+  }
+
+  test("verify whitespace handling - Unicode en space") {
+parser.parsePlan("SELECT\u20021") // Unicode en space
+  }
+
+  test("verify whitespace handling - Unicode em space") {
+parser.parsePlan("SELECT\u20031") // Unicode em space
+  }
+
+  test("verify whitespace handling - Unicode three-per-em space") {
+parser.parsePlan("SELECT\u20041") // Unicode three-per-em space
+  }
+
+  test("verify whitespace handling - Unicode four-per-em space") {
+parser.parsePlan("SELECT\u20051") // Unicode four-per-em space
+  }
+
+  test("verify whitespace

(spark) branch master updated (1623b2d513d2 -> d8dc0c3e5e8a)

2024-03-28 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 1623b2d513d2 [SPARK-47630][BUILD] Upgrade `zstd-jni` to 1.5.6-1
 add d8dc0c3e5e8a [SPARK-47632][BUILD] Ban 
`com.amazonaws:aws-java-sdk-bundle` dependency

No new revisions were added by this update.

Summary of changes:
 pom.xml | 1 +
 1 file changed, 1 insertion(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (2e4f2b0d307f -> 1623b2d513d2)

2024-03-28 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 2e4f2b0d307f [SPARK-47475][CORE][K8S] Support 
`spark.kubernetes.jars.avoidDownloadSchemes` for K8s Cluster Mode
 add 1623b2d513d2 [SPARK-47630][BUILD] Upgrade `zstd-jni` to 1.5.6-1

No new revisions were added by this update.

Summary of changes:
 dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 +-
 pom.xml   | 6 +-
 2 files changed, 6 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-47475][CORE][K8S] Support `spark.kubernetes.jars.avoidDownloadSchemes` for K8s Cluster Mode

2024-03-28 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2e4f2b0d307f [SPARK-47475][CORE][K8S] Support 
`spark.kubernetes.jars.avoidDownloadSchemes` for K8s Cluster Mode
2e4f2b0d307f is described below

commit 2e4f2b0d307fa00121de77f01826c190527ebf3d
Author: jiale_tan 
AuthorDate: Thu Mar 28 08:52:27 2024 -0700

[SPARK-47475][CORE][K8S] Support 
`spark.kubernetes.jars.avoidDownloadSchemes` for K8s Cluster Mode

### What changes were proposed in this pull request?

During spark submit, for K8s cluster mode driver, instead of always 
downloading jars and serving it to executors, avoid the download if the url 
matches `spark.kubernetes.jars.avoidDownloadSchemes` in the configuration.

### Why are the changes needed?

For K8s cluster mode driver, `SparkSubmit` will download all the jars in 
the `spark.jars` to driver and then those jars' urls in `spark.jars` will be 
replaced by the driver local paths. Later when driver starts the 
`SparkContext`, it will copy all the `spark.jars` to 
`spark.app.initial.jar.urls`, start a file server and replace the jars with 
driver local paths in `spark.app.initial.jar.urls` with file service urls. When 
the executors start, they will download those driver local jars b [...]
When jars are big and the spark application requests a lot of executors, 
the executors' massive concurrent download of the jars from the driver will 
cause network saturation. In this case, the executors jar download will 
timeout, causing executors to be terminated. From user point of view, the 
application is trapped in the loop of massive executor loss and re-provision, 
but never gets enough live executors as requested, leads to SLA breach or 
sometimes failure.
So instead of letting driver to download the jars and then serve them to 
executors, if we just avoid driver from downloading the jars and keeping the 
urls in `spark.jars` as they were, the executor will try to directly download 
the jars from the urls provided by user. This will avoid the driver download 
bottleneck mentioned above, especially when jar urls are with scalable storage 
schemes, like s3 or hdfs.
Meanwhile, there are cases jar urls are with schemes of less scalable than 
driver file server, e.g. http, ftp, etc, or when the jars are small, or 
executor count is small - user may still want to fall back to current solution 
and use driver file server to serve the jars.
So in this case, make the driver jars downloading and serving optional by 
scheme (similar idea to `FORCE_DOWNLOAD_SCHEMES` in YARN) is a good approach 
for the solution.

### Does this PR introduce _any_ user-facing change?

A configuration `spark.kubernetes.jars.avoidDownloadSchemes` is added

### How was this patch tested?

- Unit tests added
- Tested with an application running on AWS EKS submitted with a 1GB jar on 
s3.
  - Before the fix, the application could not scale to 1k live executors.
  - After the fix, the application had no problem to scale beyond 12k live 
executors.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #45715 from leletan/allow_k8s_executor_to_download_remote_jar.

Authored-by: jiale_tan 
Signed-off-by: Dongjoon Hyun 
---
 .../org/apache/spark/deploy/SparkSubmit.scala  | 28 ++-
 .../org/apache/spark/internal/config/package.scala | 12 +++
 .../org/apache/spark/deploy/SparkSubmitSuite.scala | 42 ++
 docs/running-on-kubernetes.md  | 12 +++
 4 files changed, 86 insertions(+), 8 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala 
b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
index c8cbedd9ea36..c60fbe537cbd 100644
--- a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
@@ -401,16 +401,23 @@ private[spark] class SparkSubmit extends Logging {
 // SPARK-33782 : This downloads all the files , jars , archiveFiles 
and pyfiles to current
 // working directory
 // SPARK-43540: add current working directory into driver classpath
+// SPARK-47475: make download to driver optional so executors may 
fetch resource from remote
+// url directly to avoid overwhelming driver network when resource is 
big and executor count
+// is high
 val workingDirectory = "."
 childClasspath += workingDirectory
-def downloadResourcesToCurrentDirectory(uris: String, isArchive: 
Boolean = false):
-String = {
+def downloadResourcesToCurrentDirectory(
+uris: String,
+isArchive: Boolean = false,
+

(spark) branch master updated: [MINOR][CORE] Replace `get+getOrElse` with `getOrElse` with default value in `StreamingQueryException`

2024-03-28 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 32e73dd55416 [MINOR][CORE] Replace `get+getOrElse` with `getOrElse` 
with default value in `StreamingQueryException`
32e73dd55416 is described below

commit 32e73dd55416b3a0a81ea6b6635e6fedde378842
Author: yangjie01 
AuthorDate: Thu Mar 28 07:37:01 2024 -0700

[MINOR][CORE] Replace `get+getOrElse` with `getOrElse` with default value 
in `StreamingQueryException`

### What changes were proposed in this pull request?
This PR replaces `get + getOrElse` with `getOrElse` with a default value in 
`StreamingQueryException`.

### Why are the changes needed?
Simplify code

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #45753 from LuciferYang/minor-getOrElse.

Authored-by: yangjie01 
Signed-off-by: Dongjoon Hyun 
---
 .../org/apache/spark/sql/streaming/StreamingQueryException.scala| 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git 
a/common/utils/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryException.scala
 
b/common/utils/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryException.scala
index 800a6dcfda8d..259f4330224c 100644
--- 
a/common/utils/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryException.scala
+++ 
b/common/utils/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryException.scala
@@ -48,11 +48,11 @@ class StreamingQueryException private[sql](
   errorClass: String,
   messageParameters: Map[String, String]) = {
 this(
-  messageParameters.get("queryDebugString").getOrElse(""),
+  messageParameters.getOrElse("queryDebugString", ""),
   message,
   cause,
-  messageParameters.get("startOffset").getOrElse(""),
-  messageParameters.get("endOffset").getOrElse(""),
+  messageParameters.getOrElse("startOffset", ""),
+  messageParameters.getOrElse("endOffset", ""),
   errorClass,
   messageParameters)
   }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated (8c4d6764674f -> 4b58a631fea9)

2024-03-28 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 8c4d6764674f [SPARK-47559][SQL] Codegen Support for variant 
`parse_json`
 add 4b58a631fea9 [SPARK-47628][SQL] Fix Postgres bit array issue 'Cannot 
cast to boolean'

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/jdbc/PostgresIntegrationSuite.scala | 10 ++
 .../sql/execution/datasources/jdbc/JdbcUtils.scala| 16 
 .../org/apache/spark/sql/jdbc/PostgresDialect.scala   | 19 ++-
 3 files changed, 36 insertions(+), 9 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-47559][SQL] Codegen Support for variant `parse_json`

2024-03-28 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8c4d6764674f [SPARK-47559][SQL] Codegen Support for variant 
`parse_json`
8c4d6764674f is described below

commit 8c4d6764674fcddf30245dfc25ef825eabba0ace
Author: panbingkun 
AuthorDate: Thu Mar 28 20:34:42 2024 +0800

[SPARK-47559][SQL] Codegen Support for variant `parse_json`

### What changes were proposed in this pull request?
The PR adds Codegen Support for `parse_json`.

### Why are the changes needed?
Improve codegen coverage.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Add new UT.
- Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #45714 from panbingkun/ParseJson_CodeGenerator.

Authored-by: panbingkun 
Signed-off-by: Wenchen Fan 
---
 ...ions.scala => VariantExpressionEvalUtils.scala} |  35 ++
 .../expressions/variant/variantExpressions.scala   |  34 ++---
 .../variant/VariantExpressionEvalUtilsSuite.scala  | 125 +++
 .../variant/VariantExpressionSuite.scala   | 137 +
 .../apache/spark/sql/VariantEndToEndSuite.scala|  84 +
 5 files changed, 229 insertions(+), 186 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtils.scala
similarity index 56%
copy from 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala
copy to 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtils.scala
index cab61d2b12c2..74fae91f98a6 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtils.scala
@@ -19,35 +19,17 @@ package org.apache.spark.sql.catalyst.expressions.variant
 
 import scala.util.control.NonFatal
 
-import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback
 import org.apache.spark.sql.catalyst.util.BadRecordException
 import org.apache.spark.sql.errors.QueryExecutionErrors
-import org.apache.spark.sql.types._
 import org.apache.spark.types.variant.{VariantBuilder, 
VariantSizeLimitException, VariantUtil}
-import org.apache.spark.unsafe.types._
+import org.apache.spark.unsafe.types.{UTF8String, VariantVal}
 
-// scalastyle:off line.size.limit
-@ExpressionDescription(
-  usage = "_FUNC_(jsonStr) - Parse a JSON string as an Variant value. Throw an 
exception when the string is not valid JSON value.",
-  examples = """
-Examples:
-  > SELECT _FUNC_('{"a":1,"b":0.8}');
-   {"a":1,"b":0.8}
-  """,
-  since = "4.0.0",
-  group = "variant_funcs"
-)
-// scalastyle:on line.size.limit
-case class ParseJson(child: Expression) extends UnaryExpression
-  with NullIntolerant with ExpectsInputTypes with CodegenFallback {
-  override def inputTypes: Seq[AbstractDataType] = StringType :: Nil
-
-  override def dataType: DataType = VariantType
-
-  override def prettyName: String = "parse_json"
+/**
+ * A utility class for constructing variant expressions.
+ */
+object VariantExpressionEvalUtils {
 
-  protected override def nullSafeEval(input: Any): Any = {
+  def parseJson(input: UTF8String): VariantVal = {
 try {
   val v = VariantBuilder.parseJson(input.toString)
   new VariantVal(v.getValue, v.getMetadata)
@@ -56,10 +38,7 @@ case class ParseJson(child: Expression) extends 
UnaryExpression
 throw 
QueryExecutionErrors.variantSizeLimitError(VariantUtil.SIZE_LIMIT, "parse_json")
   case NonFatal(e) =>
 throw 
QueryExecutionErrors.malformedRecordsDetectedInRecordParsingError(
-input.toString, BadRecordException(() => 
input.asInstanceOf[UTF8String], cause = e))
+  input.toString, BadRecordException(() => input, cause = e))
 }
   }
-
-  override protected def withNewChildInternal(newChild: Expression): ParseJson 
=
-copy(child = newChild)
 }
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala
index cab61d2b12c2..00708f863e81 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala
@@ -17,15 +17,9 @@
 
 package

(spark) branch master updated: [SPARK-47621][PYTHON][DOCS] Refine docstring of `try_sum`, `try_avg`, `avg`, `sum`, `mean`

2024-03-28 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b594c4edb383 [SPARK-47621][PYTHON][DOCS] Refine docstring of 
`try_sum`, `try_avg`, `avg`, `sum`, `mean`
b594c4edb383 is described below

commit b594c4edb38364139adc3934b14284d9ed9c7d46
Author: Hyukjin Kwon 
AuthorDate: Thu Mar 28 20:24:16 2024 +0800

[SPARK-47621][PYTHON][DOCS] Refine docstring of `try_sum`, `try_avg`, 
`avg`, `sum`, `mean`

### What changes were proposed in this pull request?

This PR refines docstring of  `try_sum`, `try_avg`, `avg`, `sum`, `mean`  
with more descriptive examples.

### Why are the changes needed?

For better API reference documentation.

### Does this PR introduce _any_ user-facing change?

Yes, it fixes user-facing documentation.

### How was this patch tested?

Manually tested. GitHub Actions should verify them.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #45745 from HyukjinKwon/SPARK-47621.

Lead-authored-by: Hyukjin Kwon 
Co-authored-by: Hyukjin Kwon 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/functions/builtin.py | 149 
 1 file changed, 130 insertions(+), 19 deletions(-)

diff --git a/python/pyspark/sql/functions/builtin.py 
b/python/pyspark/sql/functions/builtin.py
index 59167ad9e736..386d28cca0c0 100644
--- a/python/pyspark/sql/functions/builtin.py
+++ b/python/pyspark/sql/functions/builtin.py
@@ -528,15 +528,45 @@ def try_avg(col: "ColumnOrName") -> Column:
 
 Examples
 
+Example 1: Calculating the average age
+
 >>> import pyspark.sql.functions as sf
->>> spark.createDataFrame(
-... [(1982, 15), (1990, 2)], ["birth", "age"]
-... ).select(sf.try_avg("age")).show()
+>>> df = spark.createDataFrame([(1982, 15), (1990, 2)], ["birth", "age"])
+>>> df.select(sf.try_avg("age")).show()
 ++
 |try_avg(age)|
 ++
 | 8.5|
 ++
+
+Example 2: Calculating the average age with None
+
+>>> import pyspark.sql.functions as sf
+>>> df = spark.createDataFrame([(1982, None), (1990, 2), (2000, 4)], 
["birth", "age"])
+>>> df.select(sf.try_avg("age")).show()
+++
+|try_avg(age)|
+++
+| 3.0|
+++
+
+Example 3: Overflow results in NULL when ANSI mode is on
+
+>>> from decimal import Decimal
+>>> import pyspark.sql.functions as sf
+>>> origin = spark.conf.get("spark.sql.ansi.enabled")
+>>> spark.conf.set("spark.sql.ansi.enabled", "true")
+>>> try:
+... df = spark.createDataFrame(
+... [(Decimal("1" * 38),), (Decimal(0),)], "number DECIMAL(38, 0)")
+... df.select(sf.try_avg(df.number)).show()
+... finally:
+... spark.conf.set("spark.sql.ansi.enabled", origin)
++---+
+|try_avg(number)|
++---+
+|   NULL|
++---+
 """
 return _invoke_function_over_columns("try_avg", col)
 
@@ -720,13 +750,55 @@ def try_sum(col: "ColumnOrName") -> Column:
 
 Examples
 
->>> import pyspark.sql.functions as sf
->>> spark.range(10).select(sf.try_sum("id")).show()
+Example 1: Calculating the sum of values in a column
+
+>>> from pyspark.sql import functions as sf
+>>> df = spark.range(10)
+>>> df.select(sf.try_sum(df["id"])).show()
 +---+
 |try_sum(id)|
 +---+
 | 45|
 +---+
+
+Example 2: Using a plus expression together to calculate the sum
+
+>>> from pyspark.sql import functions as sf
+>>> df = spark.createDataFrame([(1, 2), (3, 4)], ["A", "B"])
+>>> df.select(sf.try_sum(sf.col("A") + sf.col("B"))).show()
+++
+|try_sum((A + B))|
+++
+|  10|
+++
+
+Example 3: Calculating the summation of ages with None
+
+>>> import pyspark.sql.functions as sf
+>>> df = spark.createDataFrame([(1982, None), (1990, 2), (2000, 4)], 
["birth", "age"])
+>>> df.select(sf.try_sum("age")).show()
+++
+|try_sum(age)|
+++
+|   6|
+++
+
+Example 4: Overflow results in NULL when ANSI mode is on
+
+>>> from decimal import Decimal
+>>> import pyspark.sql.functions as sf
+>>> origin = spark.conf.get("spark.sql.ansi.enabled")
+>>> spark.conf.set("spark.sql.ansi.enabled", "true")
+>>> try:
+... df = spark.createDataFrame([(Decimal("1" * 38),)] * 10, "number 
DECIMAL(38, 0)")
+... df.select(sf.try_sum(df.number)).show()
+... finally:
+... spark.conf.set("spark.sql.ansi.enabled",

(spark) branch master updated: [MINOR][PYTHON][TESTS] Remove redundant parity tests

2024-03-28 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8a6e5bb7a292 [MINOR][PYTHON][TESTS] Remove redundant parity tests
8a6e5bb7a292 is described below

commit 8a6e5bb7a2929ded80cce33f117a1c693b4ca531
Author: Ruifeng Zheng 
AuthorDate: Thu Mar 28 20:05:36 2024 +0800

[MINOR][PYTHON][TESTS] Remove redundant parity tests

### What changes were proposed in this pull request?
Remove redundant parity tests

### Why are the changes needed?
code clean up

checked all parity tests, no other similar cases

### Does this PR introduce _any_ user-facing change?
no, test-only

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #45746 from zhengruifeng/parity_test_cleanup.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/tests/connect/test_parity_dataframe.py | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/python/pyspark/sql/tests/connect/test_parity_dataframe.py 
b/python/pyspark/sql/tests/connect/test_parity_dataframe.py
index 9293dd71c60d..343f485553a9 100644
--- a/python/pyspark/sql/tests/connect/test_parity_dataframe.py
+++ b/python/pyspark/sql/tests/connect/test_parity_dataframe.py
@@ -26,17 +26,10 @@ class DataFrameParityTests(DataFrameTestsMixin, 
ReusedConnectTestCase):
 df = self.spark.createDataFrame(data=[{"foo": "bar"}, {"foo": "baz"}])
 super().check_help_command(df)
 
-# Spark Connect throws `IllegalArgumentException` when calling `collect` 
instead of `sample`.
-def test_sample(self):
-super().test_sample()
-
 @unittest.skip("Spark Connect does not support RDD but the tests depend on 
them.")
 def test_toDF_with_schema_string(self):
 super().test_toDF_with_schema_string()
 
-def test_toDF_with_string(self):
-super().test_toDF_with_string()
-
 
 if __name__ == "__main__":
 import unittest


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-47614][CORE][DOC] Update some outdated comments about `JavaModuleOptions`

2024-03-28 Thread yao

This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 299a6597950b [SPARK-47614][CORE][DOC] Update some outdated comments 
about `JavaModuleOptions`
299a6597950b is described below

commit 299a6597950b33382d2eab274b95ba8dbbef1d53
Author: panbingkun 
AuthorDate: Thu Mar 28 19:28:27 2024 +0800

[SPARK-47614][CORE][DOC] Update some outdated comments about 
`JavaModuleOptions`

### What changes were proposed in this pull request?
The pr aims to update some outdated comments about `JavaModuleOptions`.

### Why are the changes needed?
As more and more options for JVM runtime are added to the class 
`JavaModuleOptions`, some doc comments about `JavaModuleOptions` have outdated.
Let's update it.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
N/A.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #45735 from panbingkun/SPARK-47614.

Authored-by: panbingkun 
Signed-off-by: Kent Yao 
---
 core/src/main/scala/org/apache/spark/SparkContext.scala   | 2 +-
 .../main/java/org/apache/spark/launcher/JavaModuleOptions.java| 8 +++-
 .../java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java | 2 +-
 3 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala 
b/core/src/main/scala/org/apache/spark/SparkContext.scala
index f8f0107ed139..7595488cecee 100644
--- a/core/src/main/scala/org/apache/spark/SparkContext.scala
+++ b/core/src/main/scala/org/apache/spark/SparkContext.scala
@@ -3291,7 +3291,7 @@ object SparkContext extends Logging {
   }
 
   /**
-   * SPARK-36796: This is a helper function to supplement `--add-opens` 
options to
+   * SPARK-36796: This is a helper function to supplement some JVM runtime 
options to
* `spark.driver.extraJavaOptions` and `spark.executor.extraJavaOptions`.
*/
   private def supplementJavaModuleOptions(conf: SparkConf): Unit = {
diff --git 
a/launcher/src/main/java/org/apache/spark/launcher/JavaModuleOptions.java 
b/launcher/src/main/java/org/apache/spark/launcher/JavaModuleOptions.java
index 3a8fa6c42d47..dc5840185d62 100644
--- a/launcher/src/main/java/org/apache/spark/launcher/JavaModuleOptions.java
+++ b/launcher/src/main/java/org/apache/spark/launcher/JavaModuleOptions.java
@@ -18,7 +18,7 @@
 package org.apache.spark.launcher;
 
 /**
- * This helper class is used to place the all `--add-opens` options
+ * This helper class is used to place some JVM runtime options(eg: 
`--add-opens`)
  * required by Spark when using Java 17. `DEFAULT_MODULE_OPTIONS` has added
  * `-XX:+IgnoreUnrecognizedVMOptions` to be robust.
  *
@@ -46,16 +46,14 @@ public class JavaModuleOptions {
   "-Dio.netty.tryReflectionSetAccessible=true"};
 
 /**
- * Returns the default Java options related to `--add-opens' and
- * `-XX:+IgnoreUnrecognizedVMOptions` used by Spark.
+ * Returns the default JVM runtime options used by Spark.
  */
 public static String defaultModuleOptions() {
   return String.join(" ", DEFAULT_MODULE_OPTIONS);
 }
 
 /**
- * Returns the default Java option array related to `--add-opens' and
- * `-XX:+IgnoreUnrecognizedVMOptions` used by Spark.
+ * Returns the default JVM runtime option array used by Spark.
  */
 public static String[] defaultModuleOptionArray() {
   return DEFAULT_MODULE_OPTIONS;
diff --git 
a/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java
 
b/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java
index d884f7e474c0..9b53711ebaea 100644
--- 
a/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java
+++ 
b/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java
@@ -309,7 +309,7 @@ class SparkSubmitCommandBuilder extends 
AbstractCommandBuilder {
 config.get(SparkLauncher.DRIVER_EXTRA_LIBRARY_PATH));
 }
 
-// SPARK-36796: Always add default `--add-opens` to submit command
+// SPARK-36796: Always add some JVM runtime default options to submit 
command
 addOptionString(cmd, JavaModuleOptions.defaultModuleOptions());
 addOptionString(cmd, "-Dderby.connection.requireAuthentication=false");
 cmd.add("org.apache.spark.deploy.SparkSubmit");


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-47574][INFRA] Introduce Structured Logging Framework

(spark) branch master updated: [SPARK-47629][INFRA] Add `common/variant` and `connector/kinesis-asl` to maven daily test module list

(spark) branch master updated: [SPARK-47568][SS] Fix race condition between maintenance thread and load/commit for snapshot files

(spark) branch master updated (a57f94b03e30 -> b3146721552f)

(spark) branch master updated: [SPARK-47638][PS][CONNECT] Skip column name validation in PS

(spark) branch master updated (fde7a7f3d06b -> 41cbd56ba725)

(spark) branch master updated (6cf5f9508f38 -> fde7a7f3d06b)

(spark) branch master updated (009b50e87c05 -> 6cf5f9508f38)

(spark) branch master updated (7f1eadcd9cbb -> 009b50e87c05)

(spark) branch master updated (ca0001345d0b -> 7f1eadcd9cbb)

(spark) branch branch-3.5 updated: [SPARK-47636][K8S][3.5] Use Java `17` instead of `17-jre` image in K8s Dockerfile

(spark) branch master updated (c832e2ac1d04 -> ca0001345d0b)

(spark) branch master updated: [SPARK-47492][SQL] Widen whitespace rules in lexer

(spark) branch master updated (1623b2d513d2 -> d8dc0c3e5e8a)

(spark) branch master updated (2e4f2b0d307f -> 1623b2d513d2)

(spark) branch master updated: [SPARK-47475][CORE][K8S] Support `spark.kubernetes.jars.avoidDownloadSchemes` for K8s Cluster Mode

(spark) branch master updated: [MINOR][CORE] Replace `get+getOrElse` with `getOrElse` with default value in `StreamingQueryException`

(spark) branch master updated (8c4d6764674f -> 4b58a631fea9)

(spark) branch master updated: [SPARK-47559][SQL] Codegen Support for variant `parse_json`

(spark) branch master updated: [SPARK-47621][PYTHON][DOCS] Refine docstring of `try_sum`, `try_avg`, `avg`, `sum`, `mean`

(spark) branch master updated: [MINOR][PYTHON][TESTS] Remove redundant parity tests

(spark) branch master updated: [SPARK-47614][CORE][DOC] Update some outdated comments about `JavaModuleOptions`

22 matches

Site Navigation

Mail list logo

Footer information