(spark) branch master updated: [SPARK-47574][INFRA] Introduce Structured Logging Framework
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 874d033fc61b [SPARK-47574][INFRA] Introduce Structured Logging Framework 874d033fc61b is described below commit 874d033fc61becb5679db70c804592a0f9cc37ed Author: Gengliang Wang AuthorDate: Thu Mar 28 22:58:51 2024 -0700 [SPARK-47574][INFRA] Introduce Structured Logging Framework ### What changes were proposed in this pull request? Introduce Structured Logging Framework as per [SPIP: Structured Logging Framework for Apache Spark](https://docs.google.com/document/d/1rATVGmFLNVLmtxSpWrEceYm7d-ocgu8ofhryVs4g3XU/edit?usp=sharing) . * The default logging output format will be json lines. For example ``` { "ts":"2023-03-12T12:02:46.661-0700", "level":"ERROR", "msg":"Cannot determine whether executor 289 is alive or not", "context":{ "executor_id":"289" }, "exception":{ "class":"org.apache.spark.SparkException", "msg":"Exception thrown in awaitResult", "stackTrace":"..." }, "source":"BlockManagerMasterEndpoint" } ``` * Introduce a new configuration `spark.log.structuredLogging.enabled` to set the default log4j configuration. It is true by default. Users can disable it to get plain text log outputs. * The change will start with the `logError` method. Example changes on the API: from ``` logError(s"Cannot determine whether executor $executorId is alive or not.", e) ``` to ``` logError(log"Cannot determine whether executor ${MDC(EXECUTOR_ID, executorId)} is alive or not.", e) ``` ### Why are the changes needed? To enhance Apache Spark's logging system by implementing structured logging. This transition will change the format of the default log output from plain text to JSON lines, making it more analyzable. ### Does this PR introduce _any_ user-facing change? Yes, the default log output format will be json lines instead of plain text. User can restore the default plain text output when disabling configuration `spark.log.structuredLogging.enabled`. If a user is a customized log4j configuration, there is no changes in the log output. ### How was this patch tested? New Unit tests ### Was this patch authored or co-authored using generative AI tooling? Yes, some of the code comments are from github copilot Closes #45729 from gengliangwang/LogInterpolator. Authored-by: Gengliang Wang Signed-off-by: Gengliang Wang --- common/utils/pom.xml | 4 + .../resources/org/apache/spark/SparkLayout.json| 38 .../org/apache/spark/log4j2-defaults.properties| 4 +- ...s => log4j2-pattern-layout-defaults.properties} | 0 .../scala/org/apache/spark/internal/LogKey.scala | 25 + .../scala/org/apache/spark/internal/Logging.scala | 105 - common/utils/src/test/resources/log4j2.properties | 50 ++ .../apache/spark/util/PatternLoggingSuite.scala| 58 .../apache/spark/util/StructuredLoggingSuite.scala | 83 .../org/apache/spark/deploy/SparkSubmit.scala | 5 + .../org/apache/spark/internal/config/package.scala | 10 ++ dev/deps/spark-deps-hadoop-3-hive-2.3 | 1 + docs/core-migration-guide.md | 2 + pom.xml| 5 + 14 files changed, 386 insertions(+), 4 deletions(-) diff --git a/common/utils/pom.xml b/common/utils/pom.xml index d360e041dd64..1dbf2a769fff 100644 --- a/common/utils/pom.xml +++ b/common/utils/pom.xml @@ -98,6 +98,10 @@ org.apache.logging.log4j log4j-1.2-api + + org.apache.logging.log4j + log4j-layout-template-json + target/scala-${scala.binary.version}/classes diff --git a/common/utils/src/main/resources/org/apache/spark/SparkLayout.json b/common/utils/src/main/resources/org/apache/spark/SparkLayout.json new file mode 100644 index ..b0d8ea27ffbc --- /dev/null +++ b/common/utils/src/main/resources/org/apache/spark/SparkLayout.json @@ -0,0 +1,38 @@ +{ + "ts": { +"$resolver": "timestamp" + }, + "level": { +"$resolver": "level", +"field": "name" + }, + "msg": { +"$resolver": "message", +"stringified": true + }, + "context": { +"$resolver": "mdc" + }, + "exception": { +"class": { + "$resolver": "exception", + "field": "className" +}, +"msg": { + "$resolver": "exception", + "field": "message", + "stringified": true +}, +"stacktrace": { + "$resolver": "exception", + "field": "stackTrace", + "stackTrace": { +"stringified":
(spark) branch master updated: [SPARK-47629][INFRA] Add `common/variant` and `connector/kinesis-asl` to maven daily test module list
This is an automated email from the ASF dual-hosted git repository. yangjie01 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a8b247e9a50a [SPARK-47629][INFRA] Add `common/variant` and `connector/kinesis-asl` to maven daily test module list a8b247e9a50a is described below commit a8b247e9a50ae0450360e76bc69b2c6cdf5ea6f8 Author: yangjie01 AuthorDate: Fri Mar 29 13:26:40 2024 +0800 [SPARK-47629][INFRA] Add `common/variant` and `connector/kinesis-asl` to maven daily test module list ### What changes were proposed in this pull request? This pr add `common/variant` and `connector/kinesis-asl` to maven daily test module list. ### Why are the changes needed? Synchronize the modules to be tested in Maven daily test ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Monitor GA after merge ### Was this patch authored or co-authored using generative AI tooling? No Closes #45754 from LuciferYang/SPARK-47629. Authored-by: yangjie01 Signed-off-by: yangjie01 --- .github/workflows/maven_test.yml | 15 --- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/.github/workflows/maven_test.yml b/.github/workflows/maven_test.yml index 34fa9a8b7768..b01f08a23e47 100644 --- a/.github/workflows/maven_test.yml +++ b/.github/workflows/maven_test.yml @@ -62,7 +62,7 @@ jobs: - hive2.3 modules: - >- - core,launcher,common#unsafe,common#kvstore,common#network-common,common#network-shuffle,common#sketch,common#utils + core,launcher,common#unsafe,common#kvstore,common#network-common,common#network-shuffle,common#sketch,common#utils,common#variant - >- graphx,streaming,hadoop-cloud - >- @@ -70,7 +70,7 @@ jobs: - >- repl,sql#hive-thriftserver - >- - connector#kafka-0-10,connector#kafka-0-10-sql,connector#kafka-0-10-token-provider,connector#spark-ganglia-lgpl,connector#protobuf,connector#avro + connector#kafka-0-10,connector#kafka-0-10-sql,connector#kafka-0-10-token-provider,connector#spark-ganglia-lgpl,connector#protobuf,connector#avro,connector#kinesis-asl - >- sql#api,sql#catalyst,resource-managers#yarn,resource-managers#kubernetes#core # Here, we split Hive and SQL tests into some of slow ones and the rest of them. @@ -188,20 +188,21 @@ jobs: export MAVEN_OPTS="-Xss64m -Xmx4g -Xms4g -XX:ReservedCodeCacheSize=128m -Dorg.slf4j.simpleLogger.defaultLogLevel=WARN" export MAVEN_CLI_OPTS="--no-transfer-progress" export JAVA_VERSION=${{ matrix.java }} + export ENABLE_KINESIS_TESTS=0 # Replace with the real module name, for example, connector#kafka-0-10 -> connector/kafka-0-10 export TEST_MODULES=`echo "$MODULES_TO_TEST" | sed -e "s%#%/%g"` - ./build/mvn $MAVEN_CLI_OPTS -DskipTests -Pyarn -Pkubernetes -Pvolcano -Phive -Phive-thriftserver -Phadoop-cloud -Pspark-ganglia-lgpl -Djava.version=${JAVA_VERSION/-ea} clean install + ./build/mvn $MAVEN_CLI_OPTS -DskipTests -Pyarn -Pkubernetes -Pvolcano -Phive -Phive-thriftserver -Phadoop-cloud -Pspark-ganglia-lgpl -Pkinesis-asl -Djava.version=${JAVA_VERSION/-ea} clean install if [[ "$INCLUDED_TAGS" != "" ]]; then -./build/mvn $MAVEN_CLI_OPTS -pl "$TEST_MODULES" -Pyarn -Pkubernetes -Pvolcano -Phive -Phive-thriftserver -Phadoop-cloud -Pspark-ganglia-lgpl -Djava.version=${JAVA_VERSION/-ea} -Dtest.include.tags="$INCLUDED_TAGS" test -fae +./build/mvn $MAVEN_CLI_OPTS -pl "$TEST_MODULES" -Pyarn -Pkubernetes -Pvolcano -Phive -Phive-thriftserver -Phadoop-cloud -Pspark-ganglia-lgpl -Pkinesis-asl -Djava.version=${JAVA_VERSION/-ea} -Dtest.include.tags="$INCLUDED_TAGS" test -fae elif [[ "$MODULES_TO_TEST" == "connect" ]]; then ./build/mvn $MAVEN_CLI_OPTS -Dtest.exclude.tags="$EXCLUDED_TAGS" -Djava.version=${JAVA_VERSION/-ea} -pl connector/connect/client/jvm,connector/connect/common,connector/connect/server test -fae elif [[ "$EXCLUDED_TAGS" != "" ]]; then -./build/mvn $MAVEN_CLI_OPTS -pl "$TEST_MODULES" -Pyarn -Pkubernetes -Pvolcano -Phive -Phive-thriftserver -Phadoop-cloud -Pspark-ganglia-lgpl -Djava.version=${JAVA_VERSION/-ea} -Dtest.exclude.tags="$EXCLUDED_TAGS" test -fae +./build/mvn $MAVEN_CLI_OPTS -pl "$TEST_MODULES" -Pyarn -Pkubernetes -Pvolcano -Phive -Phive-thriftserver -Phadoop-cloud -Pspark-ganglia-lgpl -Pkinesis-asl -Djava.version=${JAVA_VERSION/-ea} -Dtest.exclude.tags="$EXCLUDED_TAGS" test -fae elif [[ "$MODULES_TO_TEST" == *"sql#hive-thriftserver"* ]]; then # To avoid a compilation loop, for the `sql/hive-thriftserver` module,
(spark) branch master updated: [SPARK-47568][SS] Fix race condition between maintenance thread and load/commit for snapshot files
This is an automated email from the ASF dual-hosted git repository. kabhwan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 0b844e52b35b [SPARK-47568][SS] Fix race condition between maintenance thread and load/commit for snapshot files 0b844e52b35b is described below commit 0b844e52b35b0491717ba9f0ae8fe2e0cf45e88d Author: Bhuwan Sahni AuthorDate: Fri Mar 29 13:23:15 2024 +0900 [SPARK-47568][SS] Fix race condition between maintenance thread and load/commit for snapshot files ### What changes were proposed in this pull request? This PR fixes a race condition between the maintenance thread and task thread when change-log checkpointing is enabled, and ensure all snapshots are valid. 1. The maintenance thread currently relies on class variable lastSnapshot to find the latest checkpoint and uploads it to DFS. This checkpoint can be modified at commit time by Task thread if a new snapshot is created. 2. The task thread was not resetting the lastSnapshot at load time, which can result in newer snapshots (if a old version is loaded) being considered valid and uploaded to DFS. This results in VersionIdMismatch errors. ### Why are the changes needed? These are logical bugs which can cause `VersionIdMismatch` errors causing user to discard the snapshot and restart the query. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test cases. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45724 from sahnib/rocks-db-fix. Authored-by: Bhuwan Sahni Signed-off-by: Jungtaek Lim --- .../sql/execution/streaming/state/RocksDB.scala| 66 ++ .../streaming/state/RocksDBFileManager.scala | 3 +- .../execution/streaming/state/RocksDBSuite.scala | 37 3 files changed, 82 insertions(+), 24 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala index 1817104a5c22..fcefc1666f3a 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala @@ -19,6 +19,7 @@ package org.apache.spark.sql.execution.streaming.state import java.io.File import java.util.Locale +import java.util.concurrent.TimeUnit import javax.annotation.concurrent.GuardedBy import scala.collection.{mutable, Map} @@ -49,6 +50,7 @@ case object RollbackStore extends RocksDBOpType("rollback_store") case object CloseStore extends RocksDBOpType("close_store") case object ReportStoreMetrics extends RocksDBOpType("report_store_metrics") case object StoreTaskCompletionListener extends RocksDBOpType("store_task_completion_listener") +case object StoreMaintenance extends RocksDBOpType("store_maintenance") /** * Class representing a RocksDB instance that checkpoints version of data to DFS. @@ -184,19 +186,23 @@ class RocksDB( loadedVersion = latestSnapshotVersion // reset last snapshot version -lastSnapshotVersion = 0L +if (lastSnapshotVersion > latestSnapshotVersion) { + // discard any newer snapshots + lastSnapshotVersion = 0L + latestSnapshot = None +} openDB() numKeysOnWritingVersion = if (!conf.trackTotalNumberOfRows) { - // we don't track the total number of rows - discard the number being track - -1L -} else if (metadata.numKeys < 0) { - // we track the total number of rows, but the snapshot doesn't have tracking number - // need to count keys now - countKeys() -} else { - metadata.numKeys -} +// we don't track the total number of rows - discard the number being track +-1L + } else if (metadata.numKeys < 0) { +// we track the total number of rows, but the snapshot doesn't have tracking number +// need to count keys now +countKeys() + } else { +metadata.numKeys + } if (loadedVersion != version) replayChangelog(version) // After changelog replay the numKeysOnWritingVersion will be updated to // the correct number of keys in the loaded version. @@ -571,16 +577,14 @@ class RocksDB( // background operations. val cp = Checkpoint.create(db) cp.createCheckpoint(checkpointDir.toString) - synchronized { -// if changelog checkpointing is disabled, the snapshot is uploaded synchronously -// inside the uploadSnapshot() called below. -// If
(spark) branch master updated (a57f94b03e30 -> b3146721552f)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from a57f94b03e30 [SPARK-47638][PS][CONNECT] Skip column name validation in PS add b3146721552f [SPARK-47511][SQL] Canonicalize With expressions by re-assigning IDs No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/expressions/With.scala | 81 ++-- .../sql/catalyst/expressions/nullExpressions.scala | 15 ++- .../catalyst/optimizer/RewriteWithExpression.scala | 10 +- .../org/apache/spark/sql/internal/SQLConf.scala| 8 ++ .../catalyst/expressions/CanonicalizeSuite.scala | 105 - 5 files changed, 205 insertions(+), 14 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47638][PS][CONNECT] Skip column name validation in PS
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a57f94b03e30 [SPARK-47638][PS][CONNECT] Skip column name validation in PS a57f94b03e30 is described below commit a57f94b03e302e97c1d650a9f64596a82506df2f Author: Ruifeng Zheng AuthorDate: Fri Mar 29 10:51:56 2024 +0800 [SPARK-47638][PS][CONNECT] Skip column name validation in PS ### What changes were proposed in this pull request? Skip column name validation in PS ### Why are the changes needed? `scol_for` is an internal method, not exposed to users, so this eager validation seems unnecessary when a bad column name is used before: `scol_for` immediately fails after: silent at `scol_for` call, fail later at analysis (e.g. dtypes/schema) or execution ### Does this PR introduce _any_ user-facing change? no, test only ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #45752 from zhengruifeng/test_avoid_validation. Authored-by: Ruifeng Zheng Signed-off-by: Ruifeng Zheng --- python/pyspark/pandas/utils.py | 5 - python/pyspark/sql/connect/dataframe.py | 24 +++- 2 files changed, 15 insertions(+), 14 deletions(-) diff --git a/python/pyspark/pandas/utils.py b/python/pyspark/pandas/utils.py index 57c1ddbe6ae3..0fe2944bcabe 100644 --- a/python/pyspark/pandas/utils.py +++ b/python/pyspark/pandas/utils.py @@ -608,7 +608,10 @@ def lazy_property(fn: Callable[[Any], Any]) -> property: def scol_for(sdf: PySparkDataFrame, column_name: str) -> Column: """Return Spark Column for the given column name.""" -return sdf["`{}`".format(column_name)] +if is_remote(): +return sdf._col("`{}`".format(column_name)) # type: ignore[operator] +else: +return sdf["`{}`".format(column_name)] def column_labels_level(column_labels: List[Label]) -> int: diff --git a/python/pyspark/sql/connect/dataframe.py b/python/pyspark/sql/connect/dataframe.py index 672ac8b9c25c..b2d0cc5fedca 100644 --- a/python/pyspark/sql/connect/dataframe.py +++ b/python/pyspark/sql/connect/dataframe.py @@ -645,7 +645,7 @@ class DataFrame: if isinstance(col, Column): return col else: -return Column(ColumnReference(col, df._plan._plan_id)) +return df._col(col) return DataFrame( plan.AsOfJoin( @@ -1715,12 +1715,7 @@ class DataFrame: error_class="ATTRIBUTE_NOT_SUPPORTED", message_parameters={"attr_name": name} ) -return Column( -ColumnReference( -unparsed_identifier=name, -plan_id=self._plan._plan_id, -) -) +return self._col(name) __getattr__.__doc__ = PySparkDataFrame.__getattr__.__doc__ @@ -1756,12 +1751,7 @@ class DataFrame: if item not in self.columns: self.select(item).isLocal() -return Column( -ColumnReference( -unparsed_identifier=item, -plan_id=self._plan._plan_id, -) -) +return self._col(item) elif isinstance(item, Column): return self.filter(item) elif isinstance(item, (list, tuple)): @@ -1774,6 +1764,14 @@ class DataFrame: message_parameters={"arg_name": "item", "arg_type": type(item).__name__}, ) +def _col(self, name: str) -> Column: +return Column( +ColumnReference( +unparsed_identifier=name, +plan_id=self._plan._plan_id, +) +) + def __dir__(self) -> List[str]: attrs = set(super().__dir__()) attrs.update(self.columns) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (fde7a7f3d06b -> 41cbd56ba725)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from fde7a7f3d06b [SPARK-47546][SQL][FOLLOWUP] Improve validation when reading Variant from Parquet using non-vectorized reader add 41cbd56ba725 [SPARK-47525][SQL] Support subquery correlation joining on map attributes No new revisions were added by this update. Summary of changes: .../sql/catalyst/expressions/DynamicPruning.scala | 7 + .../FunctionTableSubqueryArgumentExpression.scala | 2 + .../spark/sql/catalyst/expressions/subquery.scala | 9 + .../spark/sql/catalyst/optimizer/Optimizer.scala | 1 + .../PullOutNestedDataOuterRefExpressions.scala | 136 .../org/apache/spark/sql/internal/SQLConf.scala| 9 + .../subquery/subquery-nested-data.sql.out | 350 + .../inputs/subquery/subquery-nested-data.sql | 50 +++ .../results/subquery/subquery-nested-data.sql.out | 314 ++ .../scala/org/apache/spark/sql/SubquerySuite.scala | 49 +-- 10 files changed, 909 insertions(+), 18 deletions(-) create mode 100644 sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PullOutNestedDataOuterRefExpressions.scala create mode 100644 sql/core/src/test/resources/sql-tests/analyzer-results/subquery/subquery-nested-data.sql.out create mode 100644 sql/core/src/test/resources/sql-tests/inputs/subquery/subquery-nested-data.sql create mode 100644 sql/core/src/test/resources/sql-tests/results/subquery/subquery-nested-data.sql.out - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (6cf5f9508f38 -> fde7a7f3d06b)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 6cf5f9508f38 [SPARK-47623][PYTHON][CONNECT][TESTS] Use `QuietTest` in parity tests add fde7a7f3d06b [SPARK-47546][SQL][FOLLOWUP] Improve validation when reading Variant from Parquet using non-vectorized reader No new revisions were added by this update. Summary of changes: .../datasources/parquet/ParquetRowConverter.scala | 38 +- .../scala/org/apache/spark/sql/VariantSuite.scala | 21 +++- 2 files changed, 43 insertions(+), 16 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (009b50e87c05 -> 6cf5f9508f38)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 009b50e87c05 [SPARK-47631][SQL] Remove unused `SQLConf.parquetOutputCommitterClass` method add 6cf5f9508f38 [SPARK-47623][PYTHON][CONNECT][TESTS] Use `QuietTest` in parity tests No new revisions were added by this update. Summary of changes: .../pyspark/sql/tests/connect/test_parity_arrow.py | 9 - .../connect/test_parity_arrow_cogrouped_map.py | 6 ++ .../tests/connect/test_parity_arrow_grouped_map.py | 6 ++ .../sql/tests/connect/test_parity_arrow_map.py | 3 +-- .../sql/tests/connect/test_parity_collection.py| 3 --- .../connect/test_parity_pandas_cogrouped_map.py| 13 ++-- .../connect/test_parity_pandas_grouped_map.py | 23 -- .../sql/tests/connect/test_parity_pandas_map.py| 20 +-- .../connect/test_parity_pandas_udf_grouped_agg.py | 5 + .../tests/connect/test_parity_pandas_udf_scalar.py | 17 +--- .../tests/connect/test_parity_pandas_udf_window.py | 2 +- .../pyspark/sql/tests/connect/test_parity_udf.py | 12 --- .../sql/tests/pandas/test_pandas_cogrouped_map.py | 15 +++--- .../sql/tests/pandas/test_pandas_grouped_map.py| 21 ++-- python/pyspark/sql/tests/pandas/test_pandas_map.py | 16 +++ python/pyspark/sql/tests/pandas/test_pandas_udf.py | 3 +-- .../tests/pandas/test_pandas_udf_grouped_agg.py| 6 +++--- .../sql/tests/pandas/test_pandas_udf_scalar.py | 18 - .../sql/tests/pandas/test_pandas_udf_window.py | 4 ++-- python/pyspark/sql/tests/test_arrow.py | 13 ++-- .../pyspark/sql/tests/test_arrow_cogrouped_map.py | 15 +- python/pyspark/sql/tests/test_arrow_grouped_map.py | 15 +- python/pyspark/sql/tests/test_arrow_map.py | 3 +-- python/pyspark/sql/tests/test_collection.py| 5 ++--- python/pyspark/sql/tests/test_creation.py | 3 +-- python/pyspark/sql/tests/test_udf.py | 10 +- python/pyspark/testing/connectutils.py | 6 ++ python/pyspark/testing/utils.py| 5 + 28 files changed, 91 insertions(+), 186 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (7f1eadcd9cbb -> 009b50e87c05)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 7f1eadcd9cbb [SPARK-47366][PYTHON] Add pyspark and dataframe parse_json aliases add 009b50e87c05 [SPARK-47631][SQL] Remove unused `SQLConf.parquetOutputCommitterClass` method No new revisions were added by this update. Summary of changes: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala | 2 -- 1 file changed, 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (ca0001345d0b -> 7f1eadcd9cbb)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from ca0001345d0b [SPARK-47635][K8S] Use Java `21` instead of `21-jre` image in K8s Dockerfile add 7f1eadcd9cbb [SPARK-47366][PYTHON] Add pyspark and dataframe parse_json aliases No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/sql/functions.scala | 11 .../source/reference/pyspark.sql/functions.rst | 1 + python/pyspark/sql/connect/functions/builtin.py| 7 ++ python/pyspark/sql/functions/builtin.py| 29 ++ python/pyspark/sql/tests/test_functions.py | 10 .../scala/org/apache/spark/sql/functions.scala | 10 .../scala/org/apache/spark/sql/VariantSuite.scala | 15 ++- 7 files changed, 82 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-47636][K8S][3.5] Use Java `17` instead of `17-jre` image in K8s Dockerfile
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new edae8edc3a45 [SPARK-47636][K8S][3.5] Use Java `17` instead of `17-jre` image in K8s Dockerfile edae8edc3a45 is described below commit edae8edc3a45a860a9402018cb44760266515154 Author: Dongjoon Hyun AuthorDate: Thu Mar 28 16:34:25 2024 -0700 [SPARK-47636][K8S][3.5] Use Java `17` instead of `17-jre` image in K8s Dockerfile ### What changes were proposed in this pull request? This PR aims to use Java 21 instead of 21-jre in K8s Dockerfile . ### Why are the changes needed? Since Apache Spark 3.5.0, SPARK-44153 starts to use `jmap` like the following. https://github.com/apache/spark/blob/c832e2ac1d04668c77493577662c639785808657/core/src/main/scala/org/apache/spark/util/Utils.scala#L2030 ``` $ docker run -it --rm eclipse-temurin:17-jre jmap /__cacert_entrypoint.sh: line 30: exec: jmap: not found ``` ``` $ docker run -it --rm eclipse-temurin:17 jmap | head -n2 Usage: jmap -clstats ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs and manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45762 from dongjoon-hyun/SPARK-47636. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../kubernetes/docker/src/main/dockerfiles/spark/Dockerfile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile index 88304c87a79c..22d8f1550128 100644 --- a/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile +++ b/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile @@ -14,7 +14,7 @@ # See the License for the specific language governing permissions and # limitations under the License. # -ARG java_image_tag=17-jre +ARG java_image_tag=17 FROM eclipse-temurin:${java_image_tag} - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (c832e2ac1d04 -> ca0001345d0b)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from c832e2ac1d04 [SPARK-47492][SQL] Widen whitespace rules in lexer add ca0001345d0b [SPARK-47635][K8S] Use Java `21` instead of `21-jre` image in K8s Dockerfile No new revisions were added by this update. Summary of changes: .../kubernetes/docker/src/main/dockerfiles/spark/Dockerfile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47492][SQL] Widen whitespace rules in lexer
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c832e2ac1d04 [SPARK-47492][SQL] Widen whitespace rules in lexer c832e2ac1d04 is described below commit c832e2ac1d04668c77493577662c639785808657 Author: Serge Rielau AuthorDate: Thu Mar 28 15:51:32 2024 -0700 [SPARK-47492][SQL] Widen whitespace rules in lexer ### What changes were proposed in this pull request? In this pull PR we extend the Lexer's understanding of WhiteSpace (what separates tokens) from the ASCII: , to the various Unicode flavors of "space" such as "narrow" and "wide". ### Why are the changes needed? SQL statements are frequently copy pasted from various sources. Many of these sources are "rich text" and based on Unicode. When doing do it is inevitable that non ASCII whitespace characters are copied. This results today in often incomprehensible syntax errors. Incomprehensible because the error message prints the "bad" whitespace just like an ASCII whitespace. So the user stands little chance to find root cause unless they use possible editor options to to highlight non ASCII space or they, by sheer luck, happen to remove the whitespace. So in this PR we acknowledge the reality and stop "discriminating" against non-ASCII whitespace. ### Does this PR introduce _any_ user-facing change? Queries that used to fail before with a Syntax error, now succeed. ### How was this patch tested? Added a new set of unit tests in SparkSQLParserSuite ### Was this patch authored or co-authored using generative AI tooling? No Closes #45620 from srielau/SPARK-47492-Widen-whitespace-rules-in-lexer. Lead-authored-by: Serge Rielau Co-authored-by: Serge Rielau Signed-off-by: Gengliang Wang --- .../spark/sql/catalyst/parser/SqlBaseLexer.g4 | 2 +- .../spark/sql/execution/SparkSqlParserSuite.scala | 80 ++ 2 files changed, 81 insertions(+), 1 deletion(-) diff --git a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4 b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4 index 7c376e226850..f5565f0a63fb 100644 --- a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4 +++ b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4 @@ -554,7 +554,7 @@ BRACKETED_COMMENT ; WS -: [ \r\n\t]+ -> channel(HIDDEN) +: [ \t\n\f\r\u000B\u00A0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u2028\u202F\u205F\u3000]+ -> channel(HIDDEN) ; // Catch-all for anything we can't recognize. diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala index c3768afa90f1..f60df77b7e9b 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala @@ -800,4 +800,84 @@ class SparkSqlParserSuite extends AnalysisTest with SharedSparkSession { start = 0, stop = 63)) } + + test("verify whitespace handling - standard whitespace") { +parser.parsePlan("SELECT 1") // ASCII space +parser.parsePlan("SELECT\r1") // ASCII carriage return +parser.parsePlan("SELECT\n1") // ASCII line feed +parser.parsePlan("SELECT\t1") // ASCII tab +parser.parsePlan("SELECT\u000B1") // ASCII vertical tab +parser.parsePlan("SELECT\f1") // ASCII form feed + } + + // Need to switch off scala style for Unicode characters + // scalastyle:off + test("verify whitespace handling - Unicode no-break space") { +parser.parsePlan("SELECT\u00A01") // Unicode no-break space + } + + test("verify whitespace handling - Unicode ogham space mark") { +parser.parsePlan("SELECT\u16801") // Unicode ogham space mark + } + + test("verify whitespace handling - Unicode en quad") { +parser.parsePlan("SELECT\u20001") // Unicode en quad + } + + test("verify whitespace handling - Unicode em quad") { +parser.parsePlan("SELECT\u20011") // Unicode em quad + } + + test("verify whitespace handling - Unicode en space") { +parser.parsePlan("SELECT\u20021") // Unicode en space + } + + test("verify whitespace handling - Unicode em space") { +parser.parsePlan("SELECT\u20031") // Unicode em space + } + + test("verify whitespace handling - Unicode three-per-em space") { +parser.parsePlan("SELECT\u20041") // Unicode three-per-em space + } + + test("verify whitespace handling - Unicode four-per-em space") { +parser.parsePlan("SELECT\u20051") // Unicode four-per-em space + } + + test("verify whitespace
(spark) branch master updated (1623b2d513d2 -> d8dc0c3e5e8a)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 1623b2d513d2 [SPARK-47630][BUILD] Upgrade `zstd-jni` to 1.5.6-1 add d8dc0c3e5e8a [SPARK-47632][BUILD] Ban `com.amazonaws:aws-java-sdk-bundle` dependency No new revisions were added by this update. Summary of changes: pom.xml | 1 + 1 file changed, 1 insertion(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (2e4f2b0d307f -> 1623b2d513d2)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 2e4f2b0d307f [SPARK-47475][CORE][K8S] Support `spark.kubernetes.jars.avoidDownloadSchemes` for K8s Cluster Mode add 1623b2d513d2 [SPARK-47630][BUILD] Upgrade `zstd-jni` to 1.5.6-1 No new revisions were added by this update. Summary of changes: dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 +- pom.xml | 6 +- 2 files changed, 6 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47475][CORE][K8S] Support `spark.kubernetes.jars.avoidDownloadSchemes` for K8s Cluster Mode
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 2e4f2b0d307f [SPARK-47475][CORE][K8S] Support `spark.kubernetes.jars.avoidDownloadSchemes` for K8s Cluster Mode 2e4f2b0d307f is described below commit 2e4f2b0d307fa00121de77f01826c190527ebf3d Author: jiale_tan AuthorDate: Thu Mar 28 08:52:27 2024 -0700 [SPARK-47475][CORE][K8S] Support `spark.kubernetes.jars.avoidDownloadSchemes` for K8s Cluster Mode ### What changes were proposed in this pull request? During spark submit, for K8s cluster mode driver, instead of always downloading jars and serving it to executors, avoid the download if the url matches `spark.kubernetes.jars.avoidDownloadSchemes` in the configuration. ### Why are the changes needed? For K8s cluster mode driver, `SparkSubmit` will download all the jars in the `spark.jars` to driver and then those jars' urls in `spark.jars` will be replaced by the driver local paths. Later when driver starts the `SparkContext`, it will copy all the `spark.jars` to `spark.app.initial.jar.urls`, start a file server and replace the jars with driver local paths in `spark.app.initial.jar.urls` with file service urls. When the executors start, they will download those driver local jars b [...] When jars are big and the spark application requests a lot of executors, the executors' massive concurrent download of the jars from the driver will cause network saturation. In this case, the executors jar download will timeout, causing executors to be terminated. From user point of view, the application is trapped in the loop of massive executor loss and re-provision, but never gets enough live executors as requested, leads to SLA breach or sometimes failure. So instead of letting driver to download the jars and then serve them to executors, if we just avoid driver from downloading the jars and keeping the urls in `spark.jars` as they were, the executor will try to directly download the jars from the urls provided by user. This will avoid the driver download bottleneck mentioned above, especially when jar urls are with scalable storage schemes, like s3 or hdfs. Meanwhile, there are cases jar urls are with schemes of less scalable than driver file server, e.g. http, ftp, etc, or when the jars are small, or executor count is small - user may still want to fall back to current solution and use driver file server to serve the jars. So in this case, make the driver jars downloading and serving optional by scheme (similar idea to `FORCE_DOWNLOAD_SCHEMES` in YARN) is a good approach for the solution. ### Does this PR introduce _any_ user-facing change? A configuration `spark.kubernetes.jars.avoidDownloadSchemes` is added ### How was this patch tested? - Unit tests added - Tested with an application running on AWS EKS submitted with a 1GB jar on s3. - Before the fix, the application could not scale to 1k live executors. - After the fix, the application had no problem to scale beyond 12k live executors. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45715 from leletan/allow_k8s_executor_to_download_remote_jar. Authored-by: jiale_tan Signed-off-by: Dongjoon Hyun --- .../org/apache/spark/deploy/SparkSubmit.scala | 28 ++- .../org/apache/spark/internal/config/package.scala | 12 +++ .../org/apache/spark/deploy/SparkSubmitSuite.scala | 42 ++ docs/running-on-kubernetes.md | 12 +++ 4 files changed, 86 insertions(+), 8 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala index c8cbedd9ea36..c60fbe537cbd 100644 --- a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala +++ b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala @@ -401,16 +401,23 @@ private[spark] class SparkSubmit extends Logging { // SPARK-33782 : This downloads all the files , jars , archiveFiles and pyfiles to current // working directory // SPARK-43540: add current working directory into driver classpath +// SPARK-47475: make download to driver optional so executors may fetch resource from remote +// url directly to avoid overwhelming driver network when resource is big and executor count +// is high val workingDirectory = "." childClasspath += workingDirectory -def downloadResourcesToCurrentDirectory(uris: String, isArchive: Boolean = false): -String = { +def downloadResourcesToCurrentDirectory( +uris: String, +isArchive: Boolean = false, +
(spark) branch master updated: [MINOR][CORE] Replace `get+getOrElse` with `getOrElse` with default value in `StreamingQueryException`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 32e73dd55416 [MINOR][CORE] Replace `get+getOrElse` with `getOrElse` with default value in `StreamingQueryException` 32e73dd55416 is described below commit 32e73dd55416b3a0a81ea6b6635e6fedde378842 Author: yangjie01 AuthorDate: Thu Mar 28 07:37:01 2024 -0700 [MINOR][CORE] Replace `get+getOrElse` with `getOrElse` with default value in `StreamingQueryException` ### What changes were proposed in this pull request? This PR replaces `get + getOrElse` with `getOrElse` with a default value in `StreamingQueryException`. ### Why are the changes needed? Simplify code ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #45753 from LuciferYang/minor-getOrElse. Authored-by: yangjie01 Signed-off-by: Dongjoon Hyun --- .../org/apache/spark/sql/streaming/StreamingQueryException.scala| 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/common/utils/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryException.scala b/common/utils/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryException.scala index 800a6dcfda8d..259f4330224c 100644 --- a/common/utils/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryException.scala +++ b/common/utils/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryException.scala @@ -48,11 +48,11 @@ class StreamingQueryException private[sql]( errorClass: String, messageParameters: Map[String, String]) = { this( - messageParameters.get("queryDebugString").getOrElse(""), + messageParameters.getOrElse("queryDebugString", ""), message, cause, - messageParameters.get("startOffset").getOrElse(""), - messageParameters.get("endOffset").getOrElse(""), + messageParameters.getOrElse("startOffset", ""), + messageParameters.getOrElse("endOffset", ""), errorClass, messageParameters) } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (8c4d6764674f -> 4b58a631fea9)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 8c4d6764674f [SPARK-47559][SQL] Codegen Support for variant `parse_json` add 4b58a631fea9 [SPARK-47628][SQL] Fix Postgres bit array issue 'Cannot cast to boolean' No new revisions were added by this update. Summary of changes: .../spark/sql/jdbc/PostgresIntegrationSuite.scala | 10 ++ .../sql/execution/datasources/jdbc/JdbcUtils.scala| 16 .../org/apache/spark/sql/jdbc/PostgresDialect.scala | 19 ++- 3 files changed, 36 insertions(+), 9 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47559][SQL] Codegen Support for variant `parse_json`
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8c4d6764674f [SPARK-47559][SQL] Codegen Support for variant `parse_json` 8c4d6764674f is described below commit 8c4d6764674fcddf30245dfc25ef825eabba0ace Author: panbingkun AuthorDate: Thu Mar 28 20:34:42 2024 +0800 [SPARK-47559][SQL] Codegen Support for variant `parse_json` ### What changes were proposed in this pull request? The PR adds Codegen Support for `parse_json`. ### Why are the changes needed? Improve codegen coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Add new UT. - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45714 from panbingkun/ParseJson_CodeGenerator. Authored-by: panbingkun Signed-off-by: Wenchen Fan --- ...ions.scala => VariantExpressionEvalUtils.scala} | 35 ++ .../expressions/variant/variantExpressions.scala | 34 ++--- .../variant/VariantExpressionEvalUtilsSuite.scala | 125 +++ .../variant/VariantExpressionSuite.scala | 137 + .../apache/spark/sql/VariantEndToEndSuite.scala| 84 + 5 files changed, 229 insertions(+), 186 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtils.scala similarity index 56% copy from sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala copy to sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtils.scala index cab61d2b12c2..74fae91f98a6 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/VariantExpressionEvalUtils.scala @@ -19,35 +19,17 @@ package org.apache.spark.sql.catalyst.expressions.variant import scala.util.control.NonFatal -import org.apache.spark.sql.catalyst.expressions._ -import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback import org.apache.spark.sql.catalyst.util.BadRecordException import org.apache.spark.sql.errors.QueryExecutionErrors -import org.apache.spark.sql.types._ import org.apache.spark.types.variant.{VariantBuilder, VariantSizeLimitException, VariantUtil} -import org.apache.spark.unsafe.types._ +import org.apache.spark.unsafe.types.{UTF8String, VariantVal} -// scalastyle:off line.size.limit -@ExpressionDescription( - usage = "_FUNC_(jsonStr) - Parse a JSON string as an Variant value. Throw an exception when the string is not valid JSON value.", - examples = """ -Examples: - > SELECT _FUNC_('{"a":1,"b":0.8}'); - {"a":1,"b":0.8} - """, - since = "4.0.0", - group = "variant_funcs" -) -// scalastyle:on line.size.limit -case class ParseJson(child: Expression) extends UnaryExpression - with NullIntolerant with ExpectsInputTypes with CodegenFallback { - override def inputTypes: Seq[AbstractDataType] = StringType :: Nil - - override def dataType: DataType = VariantType - - override def prettyName: String = "parse_json" +/** + * A utility class for constructing variant expressions. + */ +object VariantExpressionEvalUtils { - protected override def nullSafeEval(input: Any): Any = { + def parseJson(input: UTF8String): VariantVal = { try { val v = VariantBuilder.parseJson(input.toString) new VariantVal(v.getValue, v.getMetadata) @@ -56,10 +38,7 @@ case class ParseJson(child: Expression) extends UnaryExpression throw QueryExecutionErrors.variantSizeLimitError(VariantUtil.SIZE_LIMIT, "parse_json") case NonFatal(e) => throw QueryExecutionErrors.malformedRecordsDetectedInRecordParsingError( -input.toString, BadRecordException(() => input.asInstanceOf[UTF8String], cause = e)) + input.toString, BadRecordException(() => input, cause = e)) } } - - override protected def withNewChildInternal(newChild: Expression): ParseJson = -copy(child = newChild) } diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala index cab61d2b12c2..00708f863e81 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala @@ -17,15 +17,9 @@ package
(spark) branch master updated: [SPARK-47621][PYTHON][DOCS] Refine docstring of `try_sum`, `try_avg`, `avg`, `sum`, `mean`
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new b594c4edb383 [SPARK-47621][PYTHON][DOCS] Refine docstring of `try_sum`, `try_avg`, `avg`, `sum`, `mean` b594c4edb383 is described below commit b594c4edb38364139adc3934b14284d9ed9c7d46 Author: Hyukjin Kwon AuthorDate: Thu Mar 28 20:24:16 2024 +0800 [SPARK-47621][PYTHON][DOCS] Refine docstring of `try_sum`, `try_avg`, `avg`, `sum`, `mean` ### What changes were proposed in this pull request? This PR refines docstring of `try_sum`, `try_avg`, `avg`, `sum`, `mean` with more descriptive examples. ### Why are the changes needed? For better API reference documentation. ### Does this PR introduce _any_ user-facing change? Yes, it fixes user-facing documentation. ### How was this patch tested? Manually tested. GitHub Actions should verify them. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45745 from HyukjinKwon/SPARK-47621. Lead-authored-by: Hyukjin Kwon Co-authored-by: Hyukjin Kwon Signed-off-by: Ruifeng Zheng --- python/pyspark/sql/functions/builtin.py | 149 1 file changed, 130 insertions(+), 19 deletions(-) diff --git a/python/pyspark/sql/functions/builtin.py b/python/pyspark/sql/functions/builtin.py index 59167ad9e736..386d28cca0c0 100644 --- a/python/pyspark/sql/functions/builtin.py +++ b/python/pyspark/sql/functions/builtin.py @@ -528,15 +528,45 @@ def try_avg(col: "ColumnOrName") -> Column: Examples +Example 1: Calculating the average age + >>> import pyspark.sql.functions as sf ->>> spark.createDataFrame( -... [(1982, 15), (1990, 2)], ["birth", "age"] -... ).select(sf.try_avg("age")).show() +>>> df = spark.createDataFrame([(1982, 15), (1990, 2)], ["birth", "age"]) +>>> df.select(sf.try_avg("age")).show() ++ |try_avg(age)| ++ | 8.5| ++ + +Example 2: Calculating the average age with None + +>>> import pyspark.sql.functions as sf +>>> df = spark.createDataFrame([(1982, None), (1990, 2), (2000, 4)], ["birth", "age"]) +>>> df.select(sf.try_avg("age")).show() +++ +|try_avg(age)| +++ +| 3.0| +++ + +Example 3: Overflow results in NULL when ANSI mode is on + +>>> from decimal import Decimal +>>> import pyspark.sql.functions as sf +>>> origin = spark.conf.get("spark.sql.ansi.enabled") +>>> spark.conf.set("spark.sql.ansi.enabled", "true") +>>> try: +... df = spark.createDataFrame( +... [(Decimal("1" * 38),), (Decimal(0),)], "number DECIMAL(38, 0)") +... df.select(sf.try_avg(df.number)).show() +... finally: +... spark.conf.set("spark.sql.ansi.enabled", origin) ++---+ +|try_avg(number)| ++---+ +| NULL| ++---+ """ return _invoke_function_over_columns("try_avg", col) @@ -720,13 +750,55 @@ def try_sum(col: "ColumnOrName") -> Column: Examples ->>> import pyspark.sql.functions as sf ->>> spark.range(10).select(sf.try_sum("id")).show() +Example 1: Calculating the sum of values in a column + +>>> from pyspark.sql import functions as sf +>>> df = spark.range(10) +>>> df.select(sf.try_sum(df["id"])).show() +---+ |try_sum(id)| +---+ | 45| +---+ + +Example 2: Using a plus expression together to calculate the sum + +>>> from pyspark.sql import functions as sf +>>> df = spark.createDataFrame([(1, 2), (3, 4)], ["A", "B"]) +>>> df.select(sf.try_sum(sf.col("A") + sf.col("B"))).show() +++ +|try_sum((A + B))| +++ +| 10| +++ + +Example 3: Calculating the summation of ages with None + +>>> import pyspark.sql.functions as sf +>>> df = spark.createDataFrame([(1982, None), (1990, 2), (2000, 4)], ["birth", "age"]) +>>> df.select(sf.try_sum("age")).show() +++ +|try_sum(age)| +++ +| 6| +++ + +Example 4: Overflow results in NULL when ANSI mode is on + +>>> from decimal import Decimal +>>> import pyspark.sql.functions as sf +>>> origin = spark.conf.get("spark.sql.ansi.enabled") +>>> spark.conf.set("spark.sql.ansi.enabled", "true") +>>> try: +... df = spark.createDataFrame([(Decimal("1" * 38),)] * 10, "number DECIMAL(38, 0)") +... df.select(sf.try_sum(df.number)).show() +... finally: +... spark.conf.set("spark.sql.ansi.enabled",
(spark) branch master updated: [MINOR][PYTHON][TESTS] Remove redundant parity tests
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8a6e5bb7a292 [MINOR][PYTHON][TESTS] Remove redundant parity tests 8a6e5bb7a292 is described below commit 8a6e5bb7a2929ded80cce33f117a1c693b4ca531 Author: Ruifeng Zheng AuthorDate: Thu Mar 28 20:05:36 2024 +0800 [MINOR][PYTHON][TESTS] Remove redundant parity tests ### What changes were proposed in this pull request? Remove redundant parity tests ### Why are the changes needed? code clean up checked all parity tests, no other similar cases ### Does this PR introduce _any_ user-facing change? no, test-only ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #45746 from zhengruifeng/parity_test_cleanup. Authored-by: Ruifeng Zheng Signed-off-by: Ruifeng Zheng --- python/pyspark/sql/tests/connect/test_parity_dataframe.py | 7 --- 1 file changed, 7 deletions(-) diff --git a/python/pyspark/sql/tests/connect/test_parity_dataframe.py b/python/pyspark/sql/tests/connect/test_parity_dataframe.py index 9293dd71c60d..343f485553a9 100644 --- a/python/pyspark/sql/tests/connect/test_parity_dataframe.py +++ b/python/pyspark/sql/tests/connect/test_parity_dataframe.py @@ -26,17 +26,10 @@ class DataFrameParityTests(DataFrameTestsMixin, ReusedConnectTestCase): df = self.spark.createDataFrame(data=[{"foo": "bar"}, {"foo": "baz"}]) super().check_help_command(df) -# Spark Connect throws `IllegalArgumentException` when calling `collect` instead of `sample`. -def test_sample(self): -super().test_sample() - @unittest.skip("Spark Connect does not support RDD but the tests depend on them.") def test_toDF_with_schema_string(self): super().test_toDF_with_schema_string() -def test_toDF_with_string(self): -super().test_toDF_with_string() - if __name__ == "__main__": import unittest - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47614][CORE][DOC] Update some outdated comments about `JavaModuleOptions`
This is an automated email from the ASF dual-hosted git repository. yao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 299a6597950b [SPARK-47614][CORE][DOC] Update some outdated comments about `JavaModuleOptions` 299a6597950b is described below commit 299a6597950b33382d2eab274b95ba8dbbef1d53 Author: panbingkun AuthorDate: Thu Mar 28 19:28:27 2024 +0800 [SPARK-47614][CORE][DOC] Update some outdated comments about `JavaModuleOptions` ### What changes were proposed in this pull request? The pr aims to update some outdated comments about `JavaModuleOptions`. ### Why are the changes needed? As more and more options for JVM runtime are added to the class `JavaModuleOptions`, some doc comments about `JavaModuleOptions` have outdated. Let's update it. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45735 from panbingkun/SPARK-47614. Authored-by: panbingkun Signed-off-by: Kent Yao --- core/src/main/scala/org/apache/spark/SparkContext.scala | 2 +- .../main/java/org/apache/spark/launcher/JavaModuleOptions.java| 8 +++- .../java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java | 2 +- 3 files changed, 5 insertions(+), 7 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala b/core/src/main/scala/org/apache/spark/SparkContext.scala index f8f0107ed139..7595488cecee 100644 --- a/core/src/main/scala/org/apache/spark/SparkContext.scala +++ b/core/src/main/scala/org/apache/spark/SparkContext.scala @@ -3291,7 +3291,7 @@ object SparkContext extends Logging { } /** - * SPARK-36796: This is a helper function to supplement `--add-opens` options to + * SPARK-36796: This is a helper function to supplement some JVM runtime options to * `spark.driver.extraJavaOptions` and `spark.executor.extraJavaOptions`. */ private def supplementJavaModuleOptions(conf: SparkConf): Unit = { diff --git a/launcher/src/main/java/org/apache/spark/launcher/JavaModuleOptions.java b/launcher/src/main/java/org/apache/spark/launcher/JavaModuleOptions.java index 3a8fa6c42d47..dc5840185d62 100644 --- a/launcher/src/main/java/org/apache/spark/launcher/JavaModuleOptions.java +++ b/launcher/src/main/java/org/apache/spark/launcher/JavaModuleOptions.java @@ -18,7 +18,7 @@ package org.apache.spark.launcher; /** - * This helper class is used to place the all `--add-opens` options + * This helper class is used to place some JVM runtime options(eg: `--add-opens`) * required by Spark when using Java 17. `DEFAULT_MODULE_OPTIONS` has added * `-XX:+IgnoreUnrecognizedVMOptions` to be robust. * @@ -46,16 +46,14 @@ public class JavaModuleOptions { "-Dio.netty.tryReflectionSetAccessible=true"}; /** - * Returns the default Java options related to `--add-opens' and - * `-XX:+IgnoreUnrecognizedVMOptions` used by Spark. + * Returns the default JVM runtime options used by Spark. */ public static String defaultModuleOptions() { return String.join(" ", DEFAULT_MODULE_OPTIONS); } /** - * Returns the default Java option array related to `--add-opens' and - * `-XX:+IgnoreUnrecognizedVMOptions` used by Spark. + * Returns the default JVM runtime option array used by Spark. */ public static String[] defaultModuleOptionArray() { return DEFAULT_MODULE_OPTIONS; diff --git a/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java b/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java index d884f7e474c0..9b53711ebaea 100644 --- a/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java +++ b/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java @@ -309,7 +309,7 @@ class SparkSubmitCommandBuilder extends AbstractCommandBuilder { config.get(SparkLauncher.DRIVER_EXTRA_LIBRARY_PATH)); } -// SPARK-36796: Always add default `--add-opens` to submit command +// SPARK-36796: Always add some JVM runtime default options to submit command addOptionString(cmd, JavaModuleOptions.defaultModuleOptions()); addOptionString(cmd, "-Dderby.connection.requireAuthentication=false"); cmd.add("org.apache.spark.deploy.SparkSubmit"); - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org