(spark) branch master updated: [SPARK-45962][SQL] Remove `treatEmptyValuesAsNulls` and use `nullValue` option instead in XML

2023-11-16 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2814293b289 [SPARK-45962][SQL] Remove `treatEmptyValuesAsNulls` and 
use `nullValue` option instead in XML
2814293b289 is described below

commit 2814293b28967ba5f6fe819bee55a70c065f6c66
Author: Shujing Yang 
AuthorDate: Thu Nov 16 23:27:17 2023 -0800

[SPARK-45962][SQL] Remove `treatEmptyValuesAsNulls` and use `nullValue` 
option instead in XML

### What changes were proposed in this pull request?

Remove treatEmptyValuesAsNulls and use nullValue option instead in XML

### Why are the changes needed?

Today, we offer two available options to handle null values. To enhance 
user clarity and simplify usage, we propose consolidating these into a single 
option. We recommend retaining the nullValue option due to its broader semantic 
scope.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #43852 from shujingyang-db/treatEmptyValue.

Authored-by: Shujing Yang 
Signed-off-by: Dongjoon Hyun 
---
 .../org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala   | 13 ++---
 .../apache/spark/sql/catalyst/xml/StaxXmlParserUtils.scala  |  2 +-
 .../org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala  |  2 +-
 .../org/apache/spark/sql/catalyst/xml/XmlOptions.scala  |  2 --
 .../spark/sql/execution/datasources/xml/XmlSuite.scala  | 12 ++--
 5 files changed, 14 insertions(+), 17 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala
index b39b2e63526..dcf02bac1de 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala
@@ -183,8 +183,8 @@ class StaxXmlParser(
 (parser.peek, dataType) match {
   case (_: StartElement, dt: DataType) => convertComplicatedType(dt, 
attributes)
   case (_: EndElement, _: StringType) =>
-// Empty. It's null if these are explicitly treated as null, or "" is 
the null value
-if (options.treatEmptyValuesAsNulls || options.nullValue == "") {
+// Empty. It's null if "" is the null value
+if (options.nullValue == "") {
   null
 } else {
   UTF8String.fromString("")
@@ -224,7 +224,8 @@ class StaxXmlParser(
 parser.peek match {
   case _: StartElement => convertComplicatedType(dataType, attributes)
   case _: EndElement if data.isEmpty => null
-  case _: EndElement if options.treatEmptyValuesAsNulls => null
+  // treat empty values as null
+  case _: EndElement if options.nullValue == "" => null
   case _: EndElement => convertTo(data, dataType)
   case _ => convertField(parser, dataType, attributes)
 }
@@ -444,8 +445,7 @@ class StaxXmlParser(
   private def castTo(
   datum: String,
   castType: DataType): Any = {
-if ((datum == options.nullValue) ||
-  (options.treatEmptyValuesAsNulls && datum == "")) {
+if (datum == options.nullValue || datum == null) {
   null
 } else {
   castType match {
@@ -493,8 +493,7 @@ class StaxXmlParser(
 } else {
   datum
 }
-if ((value == options.nullValue) ||
-  (options.treatEmptyValuesAsNulls && value == "")) {
+if (value == options.nullValue || value == null) {
   null
 } else {
   dataType match {
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParserUtils.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParserUtils.scala
index d3b90564a75..654b78906bd 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParserUtils.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParserUtils.scala
@@ -96,7 +96,7 @@ object StaxXmlParserUtils {
   attributes.map { attr =>
 val key = options.attributePrefix + getName(attr.getName, options)
 val value = attr.getValue match {
-  case v if options.treatEmptyValuesAsNulls && v.trim.isEmpty => null
+  case v if (options.nullValue == "") && v.trim.isEmpty => null
   case v => v
 }
 key -> value
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala
index 53439879772..14470aa5fac 100644
--- 

(spark) branch master updated: [SPARK-45963][SQL][DOCS] Restore documentation for DSv2 API

2023-11-16 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a7147c8e047 [SPARK-45963][SQL][DOCS] Restore documentation for DSv2 API
a7147c8e047 is described below

commit a7147c8e04711a552009d513d900d29fcb258315
Author: Hyukjin Kwon 
AuthorDate: Thu Nov 16 22:50:43 2023 -0800

[SPARK-45963][SQL][DOCS] Restore documentation for DSv2 API

### What changes were proposed in this pull request?

This PR restores the DSv2 documentation. 
https://github.com/apache/spark/pull/38392 mistakenly added 
`org/apache/spark/sql/connect` as a private that includes 
`org/apache/spark/sql/connector`.

### Why are the changes needed?

For end users to read DSv2 documentation.

### Does this PR introduce _any_ user-facing change?

Yes, it restores the DSv2 API documentation that used to be there 
https://spark.apache.org/docs/3.3.0/api/scala/org/apache/spark/sql/connector/catalog/index.html

### How was this patch tested?

Manually tested via:

```
SKIP_PYTHONDOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 bundle exec jekyll build
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43855 from HyukjinKwon/connector-docs.

Authored-by: Hyukjin Kwon 
Signed-off-by: Dongjoon Hyun 
---
 project/SparkBuild.scala  | 2 +-
 .../apache/spark/sql/connector/catalog/SupportsMetadataColumns.java   | 4 ++--
 .../org/apache/spark/sql/connector/expressions/expressions.scala  | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala
index d76af6a06cf..b15bba0474c 100644
--- a/project/SparkBuild.scala
+++ b/project/SparkBuild.scala
@@ -1361,7 +1361,7 @@ object Unidoc {
   
.map(_.filterNot(_.getCanonicalPath.contains("org/apache/spark/util/io")))
   
.map(_.filterNot(_.getCanonicalPath.contains("org/apache/spark/util/kvstore")))
   
.map(_.filterNot(_.getCanonicalPath.contains("org/apache/spark/sql/catalyst")))
-  
.map(_.filterNot(_.getCanonicalPath.contains("org/apache/spark/sql/connect")))
+  
.map(_.filterNot(_.getCanonicalPath.contains("org/apache/spark/sql/connect/")))
   
.map(_.filterNot(_.getCanonicalPath.contains("org/apache/spark/sql/execution")))
   
.map(_.filterNot(_.getCanonicalPath.contains("org/apache/spark/sql/internal")))
   
.map(_.filterNot(_.getCanonicalPath.contains("org/apache/spark/sql/hive")))
diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsMetadataColumns.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsMetadataColumns.java
index 894184dbcc8..e42424268b4 100644
--- 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsMetadataColumns.java
+++ 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsMetadataColumns.java
@@ -58,8 +58,8 @@ public interface SupportsMetadataColumns extends Table {
* Determines how this data source handles name conflicts between metadata 
and data columns.
* 
* If true, spark will automatically rename the metadata column to resolve 
the conflict. End users
-   * can reliably select metadata columns (renamed or not) with {@link 
Dataset.metadataColumn}, and
-   * internal code can use {@link MetadataAttributeWithLogicalName} to extract 
the logical name from
+   * can reliably select metadata columns (renamed or not) with {@code 
Dataset.metadataColumn}, and
+   * internal code can use {@code MetadataAttributeWithLogicalName} to extract 
the logical name from
* a metadata attribute.
* 
* If false, the data column will hide the metadata column. It is 
recommended that Table
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/expressions/expressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/expressions/expressions.scala
index 6fabb43a895..fc41d5a98e4 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/expressions/expressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/expressions/expressions.scala
@@ -156,7 +156,7 @@ private[sql] object BucketTransform {
 }
 
 /**
- * This class represents a transform for [[ClusterBySpec]]. This is used to 
bundle
+ * This class represents a transform for `ClusterBySpec`. This is used to 
bundle
  * ClusterBySpec in CreateTable's partitioning transforms to pass it down to 
analyzer.
  */
 final case class ClusterByTransform(


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-45966][DOCS][PS] Add missing methods for API reference

2023-11-16 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a5fe85fc116 [SPARK-45966][DOCS][PS] Add missing methods for API 
reference
a5fe85fc116 is described below

commit a5fe85fc11658c0212256f654e349c6ea9e18736
Author: Haejoon Lee 
AuthorDate: Thu Nov 16 22:46:31 2023 -0800

[SPARK-45966][DOCS][PS] Add missing methods for API reference

### What changes were proposed in this pull request?

This PR proposes to add missing methods for API reference.

### Why are the changes needed?

For better API reference, we should reflect the actual status into the 
documentation.

### Does this PR introduce _any_ user-facing change?

No API changes, but user-facing documentation will be improved.

### How was this patch tested?

The existing CI, especially the documentation build should pass.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43860 from itholic/SPARK-45966.

Authored-by: Haejoon Lee 
Signed-off-by: Dongjoon Hyun 
---
 python/docs/source/reference/pyspark.pandas/indexing.rst | 6 ++
 python/docs/source/reference/pyspark.pandas/series.rst   | 1 +
 2 files changed, 7 insertions(+)

diff --git a/python/docs/source/reference/pyspark.pandas/indexing.rst 
b/python/docs/source/reference/pyspark.pandas/indexing.rst
index 71584892ca3..7ec4387bb67 100644
--- a/python/docs/source/reference/pyspark.pandas/indexing.rst
+++ b/python/docs/source/reference/pyspark.pandas/indexing.rst
@@ -105,7 +105,9 @@ Missing Values
Index.fillna
Index.dropna
Index.isna
+   Index.isnull
Index.notna
+   Index.notnull
 
 Conversion
 ~~
@@ -190,6 +192,10 @@ Categorical components
CategoricalIndex.as_ordered
CategoricalIndex.as_unordered
CategoricalIndex.map
+   CategoricalIndex.equals
+   CategoricalIndex.max
+   CategoricalIndex.min
+   CategoricalIndex.tolist
 
 .. _api.multiindex:
 
diff --git a/python/docs/source/reference/pyspark.pandas/series.rst 
b/python/docs/source/reference/pyspark.pandas/series.rst
index eb4a499c054..01fb5aa87fb 100644
--- a/python/docs/source/reference/pyspark.pandas/series.rst
+++ b/python/docs/source/reference/pyspark.pandas/series.rst
@@ -214,6 +214,7 @@ Missing data handling
 
Series.backfill
Series.bfill
+   Series.ffill
Series.isna
Series.isnull
Series.notna


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-45968][INFRA] Upgrade github docker action to latest version

2023-11-16 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0c1ba5c1e64 [SPARK-45968][INFRA] Upgrade github docker action to 
latest version
0c1ba5c1e64 is described below

commit 0c1ba5c1e64acfa6ddd97891d6b75ecb934fbcbb
Author: panbingkun 
AuthorDate: Thu Nov 16 22:44:08 2023 -0800

[SPARK-45968][INFRA] Upgrade github docker action to latest version

### What changes were proposed in this pull request?
The pr aims to upgrade `github docker action` to latest version, includes:
- `docker/login-action` from `v2` to `v3`
- `docker/setup-qemu-action` from `v2` to `v3`
- `docker/setup-buildx-action` from `v2` to `v3`
- `docker/build-push-action` from `v3` to `v5`

### Why are the changes needed?
- `docker/login-action` v3 release notes:
   https://github.com/docker/login-action/releases/tag/v3.0.0
   https://github.com/apache/spark/assets/15246973/2c40aff3-c6cd-433d-ad9c-db5409048005;>

- `docker/setup-qemu-action` v3 release notes:
   https://github.com/docker/setup-qemu-action/releases/tag/v3.0.0
   https://github.com/apache/spark/assets/15246973/cf758456-fef2-4bf8-a412-be3d69623dd0;>

- `docker/setup-buildx-action` v3 release notes:
   https://github.com/docker/setup-buildx-action/releases/tag/v3.0.0
   https://github.com/apache/spark/assets/15246973/846d8511-2b0f-4a99-8969-c28c007e7079;>

- `docker/build-push-action` v5 release notes:
   https://github.com/docker/build-push-action/releases/tag/v5.0.0
   https://github.com/apache/spark/assets/15246973/5cb4a0a4-f1f3-4275-aac3-d5e390cae1d2;>
- `docker/build-push-action` v4 release notes:
   https://github.com/docker/build-push-action/releases/tag/v4.0.0

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #43862 from panbingkun/docker_action_upgrade.

Authored-by: panbingkun 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml   | 8 
 .github/workflows/build_infra_images_cache.yml | 8 
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 95ce051f32f..e38bff3d563 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -296,7 +296,7 @@ jobs:
   packages: write
 steps:
   - name: Login to GitHub Container Registry
-uses: docker/login-action@v2
+uses: docker/login-action@v3
 with:
   registry: ghcr.io
   username: ${{ github.actor }}
@@ -316,12 +316,12 @@ jobs:
   git -c user.name='Apache Spark Test Account' -c 
user.email='sparktest...@gmail.com' merge --no-commit --progress --squash 
FETCH_HEAD
   git -c user.name='Apache Spark Test Account' -c 
user.email='sparktest...@gmail.com' commit -m "Merged commit" --allow-empty
   - name: Set up QEMU
-uses: docker/setup-qemu-action@v2
+uses: docker/setup-qemu-action@v3
   - name: Set up Docker Buildx
-uses: docker/setup-buildx-action@v2
+uses: docker/setup-buildx-action@v3
   - name: Build and push
 id: docker_build
-uses: docker/build-push-action@v3
+uses: docker/build-push-action@v5
 with:
   context: ./dev/infra/
   push: true
diff --git a/.github/workflows/build_infra_images_cache.yml 
b/.github/workflows/build_infra_images_cache.yml
index 3e025883084..49b2e2e80d9 100644
--- a/.github/workflows/build_infra_images_cache.yml
+++ b/.github/workflows/build_infra_images_cache.yml
@@ -40,18 +40,18 @@ jobs:
   - name: Checkout Spark repository
 uses: actions/checkout@v4
   - name: Set up QEMU
-uses: docker/setup-qemu-action@v2
+uses: docker/setup-qemu-action@v3
   - name: Set up Docker Buildx
-uses: docker/setup-buildx-action@v2
+uses: docker/setup-buildx-action@v3
   - name: Login to DockerHub
-uses: docker/login-action@v2
+uses: docker/login-action@v3
 with:
   registry: ghcr.io
   username: ${{ github.actor }}
   password: ${{ secrets.GITHUB_TOKEN }}
   - name: Build and push
 id: docker_build
-uses: docker/build-push-action@v3
+uses: docker/build-push-action@v5
 with:
   context: ./dev/infra/
   push: true


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [MINOR][SQL] Remove unimplemented instances in XML Data Source

2023-11-16 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6d25af4a4c8 [MINOR][SQL] Remove unimplemented instances in XML Data 
Source
6d25af4a4c8 is described below

commit 6d25af4a4c819bc3a05c2fe9b8bf92e0a5629dcd
Author: Hyukjin Kwon 
AuthorDate: Fri Nov 17 14:44:01 2023 +0900

[MINOR][SQL] Remove unimplemented instances in XML Data Source

### What changes were proposed in this pull request?

This PR removes unimplemented instances in XML Data Source. They are 
presumably copied from JSON/CSV Data Source but they are not implemented yet.

### Why are the changes needed?

They are unreachable code.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing CI in this PR should test them out.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43857 from HyukjinKwon/xml-cleanup.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 .../apache/spark/sql/catalyst/xml/StaxXmlGenerator.scala|  7 ---
 .../org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala   | 13 +
 .../org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala  | 10 --
 .../spark/sql/execution/datasources/xml/XmlFileFormat.scala |  5 +
 4 files changed, 2 insertions(+), 33 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlGenerator.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlGenerator.scala
index ae3a64d865c..c8333758229 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlGenerator.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlGenerator.scala
@@ -48,13 +48,6 @@ class StaxXmlGenerator(
 legacyFormat = FAST_DATE_FORMAT,
 isParsing = false)
 
-  private val timestampNTZFormatter = TimestampFormatter(
-options.timestampNTZFormatInWrite,
-options.zoneId,
-legacyFormat = FAST_DATE_FORMAT,
-isParsing = false,
-forTimestampNTZ = true)
-
   private val dateFormatter = DateFormatter(
 options.dateFormatInWrite,
 options.locale,
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala
index b3174b70441..b39b2e63526 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala
@@ -40,16 +40,12 @@ import 
org.apache.spark.sql.catalyst.util.LegacyDateFormats.FAST_DATE_FORMAT
 import org.apache.spark.sql.catalyst.xml.StaxXmlParser.convertStream
 import org.apache.spark.sql.errors.QueryExecutionErrors
 import org.apache.spark.sql.internal.SQLConf
-import org.apache.spark.sql.sources.Filter
 import org.apache.spark.sql.types._
 import org.apache.spark.unsafe.types.UTF8String
 
 class StaxXmlParser(
 schema: StructType,
-val options: XmlOptions,
-filters: Seq[Filter] = Seq.empty) extends Logging {
-
-  private val factory = options.buildXmlFactory()
+val options: XmlOptions) extends Logging {
 
   private lazy val timestampFormatter = TimestampFormatter(
 options.timestampFormatInRead,
@@ -58,13 +54,6 @@ class StaxXmlParser(
 legacyFormat = FAST_DATE_FORMAT,
 isParsing = true)
 
-  private lazy val timestampNTZFormatter = TimestampFormatter(
-options.timestampNTZFormatInRead,
-options.zoneId,
-legacyFormat = FAST_DATE_FORMAT,
-isParsing = true,
-forTimestampNTZ = true)
-
   private lazy val dateFormatter = DateFormatter(
 options.dateFormatInRead,
 options.locale,
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala
index eeb5a9de4ed..53439879772 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala
@@ -31,7 +31,6 @@ import scala.util.control.NonFatal
 
 import org.apache.spark.internal.Logging
 import org.apache.spark.rdd.RDD
-import org.apache.spark.sql.catalyst.expressions.ExprUtils
 import org.apache.spark.sql.catalyst.util.{DateFormatter, PermissiveMode, 
TimestampFormatter}
 import org.apache.spark.sql.catalyst.util.LegacyDateFormats.FAST_DATE_FORMAT
 import org.apache.spark.sql.types._
@@ -40,8 +39,6 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: 
Boolean)
 extends Serializable
 with Logging {
 
-  private val decimalParser = ExprUtils.getDecimalParser(options.locale)
-
   private val 

(spark) branch master updated: [SPARK-45762][CORE] Support shuffle managers defined in user jars by changing startup order

2023-11-16 Thread mridulm80
This is an automated email from the ASF dual-hosted git repository.

mridulm80 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 7c146c925b3 [SPARK-45762][CORE] Support shuffle managers defined in 
user jars by changing startup order
7c146c925b3 is described below

commit 7c146c925b363fc67eedc7411068f24dd780b583
Author: Alessandro Bellina 
AuthorDate: Thu Nov 16 21:06:06 2023 -0600

[SPARK-45762][CORE] Support shuffle managers defined in user jars by 
changing startup order

### What changes were proposed in this pull request?
As reported here https://issues.apache.org/jira/browse/SPARK-45762, 
`ShuffleManager` instances defined in a user jar cannot be used in all cases, 
unless specified in the `extraClassPath`. We would like to avoid adding extra 
configurations if this instance is already included in a jar passed via 
`--jars`.

Proposed changes:

Refactor code so we initialize the `ShuffleManager` later, after jars have 
been localized. This is especially necessary in the executor, where we would 
need to move this initialization until after the `replClassLoader` is updated 
with jars passed in `--jars`.

Before this change, the `ShuffleManager` is instantiated at `SparkEnv` 
creation. Having to instantiate the `ShuffleManager` this early doesn't work, 
because user jars have not been localized in all scenarios, and we will fail to 
load the `ShuffleManager` defined in `--jars`. We propose moving the 
`ShuffleManager` instantiation to `SparkContext` on the driver, and `Executor`.

### Why are the changes needed?
This is not a new API but a change of startup order. The changed are needed 
to improve the user experience for the user by reducing extra configurations 
depending on how a spark application is launched.

### Does this PR introduce _any_ user-facing change?
Yes, but it's backwards compatible. Users no longer need to specify a 
`ShuffleManager` jar in `extraClassPath`, but they are able to if they desire.

This change is not binary compatible with Spark 3.5.0 (see MIMA comments 
below). I have added a rule to MimaExcludes to handle it 
https://github.com/apache/spark/pull/43627/commits/970bff4edc6ba14d8de78aa175415e204d6a627b

### How was this patch tested?
Added a unit test showing that a test `ShuffleManager` is available after 
`--jars` are passed, but not without (using local-cluster mode).

Tested manually with standalone mode, local-cluster mode, yarn client and 
cluster mode, k8s.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43627 from abellina/shuffle_manager_initialization_order.

Authored-by: Alessandro Bellina 
Signed-off-by: Mridul Muralidharan gmail.com>
---
 .../main/scala/org/apache/spark/SparkContext.scala |  1 +
 .../src/main/scala/org/apache/spark/SparkEnv.scala | 38 ++-
 .../scala/org/apache/spark/executor/Executor.scala | 13 +++-
 .../org/apache/spark/shuffle/ShuffleManager.scala  | 26 +++-
 .../org/apache/spark/storage/BlockManager.scala| 14 +++-
 .../spark/storage/BlockManagerMasterEndpoint.scala |  9 ++-
 .../org/apache/spark/deploy/SparkSubmitSuite.scala | 77 ++
 .../apache/spark/deploy/SparkSubmitTestUtils.scala |  6 +-
 project/MimaExcludes.scala |  4 +-
 9 files changed, 160 insertions(+), 28 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala 
b/core/src/main/scala/org/apache/spark/SparkContext.scala
index 73dcaffa6ce..ed00baa01d6 100644
--- a/core/src/main/scala/org/apache/spark/SparkContext.scala
+++ b/core/src/main/scala/org/apache/spark/SparkContext.scala
@@ -577,6 +577,7 @@ class SparkContext(config: SparkConf) extends Logging {
 
 // Initialize any plugins before the task scheduler is initialized.
 _plugins = PluginContainer(this, _resources.asJava)
+_env.initializeShuffleManager()
 
 // Create and start the scheduler
 val (sched, ts) = SparkContext.createTaskScheduler(this, master)
diff --git a/core/src/main/scala/org/apache/spark/SparkEnv.scala 
b/core/src/main/scala/org/apache/spark/SparkEnv.scala
index 3277f86e367..94a4debd026 100644
--- a/core/src/main/scala/org/apache/spark/SparkEnv.scala
+++ b/core/src/main/scala/org/apache/spark/SparkEnv.scala
@@ -18,13 +18,13 @@
 package org.apache.spark
 
 import java.io.File
-import java.util.Locale
 
 import scala.collection.concurrent
 import scala.collection.mutable
 import scala.jdk.CollectionConverters._
 import scala.util.Properties
 
+import com.google.common.base.Preconditions
 import com.google.common.cache.CacheBuilder
 import org.apache.hadoop.conf.Configuration
 
@@ -63,7 +63,6 @@ class SparkEnv (
 val closureSerializer: Serializer,
 val serializerManager: SerializerManager,
 val mapOutputTracker: 

(spark) branch master updated: [SPARK-45964][SQL] Remove private sql accessor in XML and JSON package under catalyst package

2023-11-16 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8147620ac49 [SPARK-45964][SQL] Remove private sql accessor in XML and 
JSON package under catalyst package
8147620ac49 is described below

commit 8147620ac49ec4c82b9ef34681334a34c0ad0e37
Author: Hyukjin Kwon 
AuthorDate: Thu Nov 16 17:57:55 2023 -0800

[SPARK-45964][SQL] Remove private sql accessor in XML and JSON package 
under catalyst package

### What changes were proposed in this pull request?

This PR removes `private[sql]` in XML and JSON packages at `catalyst` 
package.

### Why are the changes needed?

`catalyst` is already a private package: 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala#L21-L22

See also SPARK-16813

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI in this PR should test them out.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43856 from HyukjinKwon/SPARK-45964.

Authored-by: Hyukjin Kwon 
Signed-off-by: Dongjoon Hyun 
---
 .../org/apache/spark/sql/catalyst/json/CreateJacksonParser.scala| 2 +-
 .../main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala | 6 +++---
 .../scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala | 2 +-
 .../scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala  | 2 +-
 .../scala/org/apache/spark/sql/catalyst/xml/CreateXmlParser.scala   | 2 +-
 .../scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala | 4 ++--
 .../org/apache/spark/sql/catalyst/xml/StaxXmlParserUtils.scala  | 2 +-
 .../scala/org/apache/spark/sql/catalyst/xml/ValidatorUtil.scala | 2 +-
 .../scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala| 2 +-
 .../main/scala/org/apache/spark/sql/catalyst/xml/XmlOptions.scala   | 4 ++--
 10 files changed, 14 insertions(+), 14 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/CreateJacksonParser.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/CreateJacksonParser.scala
index 156c6b819f2..61ef14a3f10 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/CreateJacksonParser.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/CreateJacksonParser.scala
@@ -29,7 +29,7 @@ import sun.nio.cs.StreamDecoder
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.unsafe.types.UTF8String
 
-private[sql] object CreateJacksonParser extends Serializable {
+object CreateJacksonParser extends Serializable {
   def string(jsonFactory: JsonFactory, record: String): JsonParser = {
 jsonFactory.createParser(record)
   }
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
index 596d9e39b94..e5aa0bb6d2c 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
@@ -34,7 +34,7 @@ import org.apache.spark.sql.internal.{LegacyBehaviorPolicy, 
SQLConf}
  *
  * Most of these map directly to Jackson's internal options, specified in 
[[JsonReadFeature]].
  */
-private[sql] class JSONOptions(
+class JSONOptions(
 @transient val parameters: CaseInsensitiveMap[String],
 defaultTimeZoneId: String,
 defaultColumnNameOfCorruptRecord: String)
@@ -212,7 +212,7 @@ private[sql] class JSONOptions(
   }
 }
 
-private[sql] class JSONOptionsInRead(
+class JSONOptionsInRead(
 @transient override val parameters: CaseInsensitiveMap[String],
 defaultTimeZoneId: String,
 defaultColumnNameOfCorruptRecord: String)
@@ -242,7 +242,7 @@ private[sql] class JSONOptionsInRead(
   }
 }
 
-private[sql] object JSONOptionsInRead {
+object JSONOptionsInRead {
   // The following encodings are not supported in per-line mode (multiline is 
false)
   // because they cause some problems in reading files with BOM which is 
supposed to
   // present in the files with such encodings. After splitting input files by 
lines,
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala
index 0a243c63685..e02b2860618 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala
@@ -37,7 +37,7 @@ import org.apache.spark.util.ArrayImplicits._
  * of map. An exception will be thrown if trying to 

(spark) branch master updated: [SPARK-45952][PYTHON][DOCS] Use built-in math constants in math functions

2023-11-16 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new db0da0c0b52 [SPARK-45952][PYTHON][DOCS] Use built-in math constants in 
math functions
db0da0c0b52 is described below

commit db0da0c0b52bcbc0d9ac2634773a5e21d45dc691
Author: Ruifeng Zheng 
AuthorDate: Fri Nov 17 09:26:41 2023 +0900

[SPARK-45952][PYTHON][DOCS] Use built-in math constants in math functions

### What changes were proposed in this pull request?
Use the newly added built-in math constants (`PI` and `E`) in math functions

### Why are the changes needed?
to improve the docstring

### Does this PR introduce _any_ user-facing change?
yes, doc change

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #43837 from zhengruifeng/py_doc_math.

Authored-by: Ruifeng Zheng 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/functions.py | 107 ++--
 1 file changed, 69 insertions(+), 38 deletions(-)

diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index e3b8e4965e4..655806e8377 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -1838,10 +1838,13 @@ def cos(col: "ColumnOrName") -> Column:
 
 Examples
 
->>> import math
->>> df = spark.range(1)
->>> df.select(cos(lit(math.pi))).first()
-Row(COS(3.14159...)=-1.0)
+>>> from pyspark.sql import functions as sf
+>>> spark.range(1).select(sf.cos(sf.pi())).show()
++-+
+|COS(PI())|
++-+
+| -1.0|
++-+
 """
 return _invoke_function_over_columns("cos", col)
 
@@ -1897,10 +1900,13 @@ def cot(col: "ColumnOrName") -> Column:
 
 Examples
 
->>> import math
->>> df = spark.range(1)
->>> df.select(cot(lit(math.radians(45.first()
-Row(COT(0.78539...)=1.0...)
+>>> from pyspark.sql import functions as sf
+>>> spark.range(1).select(sf.cot(sf.pi() / 4)).show()
++--+
+|   COT((PI() / 4))|
++--+
+|1.0...|
++--+
 """
 return _invoke_function_over_columns("cot", col)
 
@@ -1927,10 +1933,13 @@ def csc(col: "ColumnOrName") -> Column:
 
 Examples
 
->>> import math
->>> df = spark.range(1)
->>> df.select(csc(lit(math.radians(90.first()
-Row(CSC(1.57079...)=1.0)
+>>> from pyspark.sql import functions as sf
+>>> spark.range(1).select(sf.csc(sf.pi() / 2)).show()
++---+
+|CSC((PI() / 2))|
++---+
+|1.0|
++---+
 """
 return _invoke_function_over_columns("csc", col)
 
@@ -2091,10 +2100,13 @@ def log(col: "ColumnOrName") -> Column:
 
 Examples
 
->>> import math
->>> df = spark.range(1)
->>> df.select(log(lit(math.e))).first()
-Row(ln(2.71828...)=1.0)
+>>> from pyspark.sql import functions as sf
+>>> spark.range(1).select(sf.log(sf.e())).show()
++---+
+|ln(E())|
++---+
+|1.0|
++---+
 """
 return _invoke_function_over_columns("log", col)
 
@@ -2154,15 +2166,22 @@ def log1p(col: "ColumnOrName") -> Column:
 
 Examples
 
->>> import math
->>> df = spark.range(1)
->>> df.select(log1p(lit(math.e))).first()
-Row(LOG1P(2.71828...)=1.31326...)
+>>> from pyspark.sql import functions as sf
+>>> spark.range(1).select(sf.log1p(sf.e())).show()
++--+
+|LOG1P(E())|
++--+
+|1.3132616875182...|
++--+
 
 Same as:
 
->>> df.select(log(lit(math.e+1))).first()
-Row(ln(3.71828...)=1.31326...)
+>>> spark.range(1).select(sf.log(sf.e() + 1)).show()
++--+
+| ln((E() + 1))|
++--+
+|1.3132616875182...|
++--+
 """
 return _invoke_function_over_columns("log1p", col)
 
@@ -2416,10 +2435,13 @@ def sin(col: "ColumnOrName") -> Column:
 
 Examples
 
->>> import math
->>> df = spark.range(1)
->>> df.select(sin(lit(math.radians(90.first()
-Row(SIN(1.57079...)=1.0)
+>>> from pyspark.sql import functions as sf
+>>> spark.range(1).select(sf.sin(sf.pi() / 2)).show()
++---+
+|SIN((PI() / 2))|
++---+
+|1.0|
++---+
 """
 return _invoke_function_over_columns("sin", col)
 
@@ -2476,10 +2498,13 @@ def tan(col: "ColumnOrName") -> Column:
 
 Examples
 
->>> import math
->>> df = spark.range(1)
->>> df.select(tan(lit(math.radians(45.first()
-

(spark) branch master updated: [SPARK-45912][SQL] Enhancement of XSDToSchema API: Change to HDFS API for cloud storage accessibility

2023-11-16 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new fcf340a1de3 [SPARK-45912][SQL] Enhancement of XSDToSchema API: Change 
to HDFS API for cloud storage accessibility
fcf340a1de3 is described below

commit fcf340a1de371ce1beb2cf93473ea2f2b793801b
Author: Shujing Yang 
AuthorDate: Fri Nov 17 09:17:57 2023 +0900

[SPARK-45912][SQL] Enhancement of XSDToSchema API: Change to HDFS API for 
cloud storage accessibility

### What changes were proposed in this pull request?

Previously, it utilized `java.nio.path`, which limited file reading to 
local file systems only. By changing this to an HDFS-compatible API, we now 
enable the XSDToSchema function to access files in cloud storage.

### Why are the changes needed?

We want to enable the XSDToSchema function to access files in cloud storage.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #43789 from shujingyang-db/xsd_api.

Authored-by: Shujing Yang 
Signed-off-by: Hyukjin Kwon 
---
 .../spark/sql/catalyst/xml/ValidatorUtil.scala | 36 ---
 .../execution/datasources/xml/XSDToSchema.scala| 35 +++---
 .../datasources/xml/util/XSDToSchemaSuite.scala| 41 +++---
 3 files changed, 55 insertions(+), 57 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/ValidatorUtil.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/ValidatorUtil.scala
index f8b546332c2..0d85a512d7e 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/ValidatorUtil.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/ValidatorUtil.scala
@@ -16,6 +16,7 @@
  */
 package org.apache.spark.sql.catalyst.xml
 
+import java.io.{File, FileInputStream, InputStream}
 import javax.xml.XMLConstants
 import javax.xml.transform.stream.StreamSource
 import javax.xml.validation.{Schema, SchemaFactory}
@@ -25,28 +26,18 @@ import org.apache.hadoop.fs.Path
 
 import org.apache.spark.SparkFiles
 import org.apache.spark.deploy.SparkHadoopUtil
-import org.apache.spark.util.Utils
+import org.apache.spark.internal.Logging
 
 /**
  * Utilities for working with XSD validation.
  */
-private[sql] object ValidatorUtil {
+private[sql] object ValidatorUtil extends Logging{
   // Parsing XSDs may be slow, so cache them by path:
 
   private val cache = CacheBuilder.newBuilder().softValues().build(
 new CacheLoader[String, Schema] {
   override def load(key: String): Schema = {
-val in = try {
-  // Handle case where file exists as specified
-  val fs = Utils.getHadoopFileSystem(key, SparkHadoopUtil.get.conf)
-  fs.open(new Path(key))
-} catch {
-  case _: Throwable =>
-// Handle case where it was added with sc.addFile
-val addFileUrl = SparkFiles.get(key)
-val fs = Utils.getHadoopFileSystem(addFileUrl, 
SparkHadoopUtil.get.conf)
-fs.open(new Path(addFileUrl))
-}
+val in = openSchemaFile(new Path(key))
 try {
   val schemaFactory = 
SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI)
   schemaFactory.newSchema(new StreamSource(in))
@@ -56,6 +47,25 @@ private[sql] object ValidatorUtil {
   }
 })
 
+  def openSchemaFile(xsdPath: Path): InputStream = {
+try {
+  // Handle case where file exists as specified
+  val fs = xsdPath.getFileSystem(SparkHadoopUtil.get.conf)
+  fs.open(xsdPath)
+} catch {
+  case e: Throwable =>
+// Handle case where it was added with sc.addFile
+// When they are added via sc.addFile, they are always downloaded to 
local file system
+logInfo(s"$xsdPath was not found, falling back to look up files added 
by Spark")
+val f = new File(SparkFiles.get(xsdPath.toString))
+if (f.exists()) {
+  new FileInputStream(f)
+} else {
+  throw e
+}
+}
+  }
+
   /**
* Parses the XSD at the given local path and caches it.
*
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XSDToSchema.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XSDToSchema.scala
index b0894ed3484..356ffd57698 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XSDToSchema.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XSDToSchema.scala
@@ -16,53 +16,42 @@
  */
 package org.apache.spark.sql.execution.datasources.xml
 
-import java.io.{File, FileInputStream, 

(spark) branch master updated: [SPARK-45950][INFRA][CORE] Fix `IvyTestUtils#createIvyDescriptor` function and make `common-utils` module can run tests on GitHub Action

2023-11-16 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f6b670a650d [SPARK-45950][INFRA][CORE] Fix 
`IvyTestUtils#createIvyDescriptor` function and make `common-utils` module can 
run tests on GitHub Action
f6b670a650d is described below

commit f6b670a650d23ef4b2be5ed5c903091361a26a63
Author: yangjie01 
AuthorDate: Fri Nov 17 08:41:14 2023 +0900

[SPARK-45950][INFRA][CORE] Fix `IvyTestUtils#createIvyDescriptor` function 
and make `common-utils` module can run tests on GitHub Action

### What changes were proposed in this pull request?
This PR mainly does two things:
1. It revert a line of code in `IvyTestUtils.scala` that was mistakenly 
deleted in SPARK-45506 | https://github.com/apache/spark/pull/43354 to ensure 
that the `ivy.xml` file generated by `IvyTestUtils#createIvyDescriptor` is 
complete. Before this PR, the generated `ivy.xml` file would missing the 
`` end tag, which would cause two test cases in `MavenUtilsSuite` 
to fail. We can reproduce the problem by executing the `build/sbt 
"common-utils/test"` command:

```
[info] MavenUtilsSuite:
[info] - incorrect maven coordinate throws error (8 milliseconds)
[info] - create repo resolvers (24 milliseconds)
[info] - create additional resolvers (3 milliseconds)
:: loading settings :: url = 
jar:file:/Users/yangjie01/Library/Caches/Coursier/v1/https/repo1.maven.org/maven2/org/apache/ivy/ivy/2.5.1/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
[info] - add dependencies works correctly (35 milliseconds)
[info] - excludes works correctly (2 milliseconds)
[info] - ivy path works correctly (3 seconds, 759 milliseconds)
[info] - search for artifact at local repositories *** FAILED *** (2 
seconds, 833 milliseconds)
[info]   java.lang.RuntimeException: [unresolved dependency: 
my.great.lib#mylib;0.1: java.text.ParseException: [[Fatal Error] 
ivy-0.1.xml.original:22:18: XML document structures must start and end within 
the same entity. in 
file:/SourceCode/git/spark-mine-sbt/target/tmp/ivy-8b860aca-a9c4-4af9-b15a-ac8c6049b773/cache/my.great.lib/mylib/ivy-0.1.xml.original
[info] ]]
[info]   at 
org.apache.spark.util.MavenUtils$.resolveMavenCoordinates(MavenUtils.scala:459)
[info]   at 
org.apache.spark.util.MavenUtilsSuite.$anonfun$new$25(MavenUtilsSuite.scala:173)
[info]   at 
org.apache.spark.util.MavenUtilsSuite.$anonfun$new$25$adapted(MavenUtilsSuite.scala:172)
[info]   at 
org.apache.spark.util.IvyTestUtils$.withRepository(IvyTestUtils.scala:373)
[info]   at 
org.apache.spark.util.MavenUtilsSuite.$anonfun$new$18(MavenUtilsSuite.scala:172)
[info]   at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
[info]   at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
[info]   at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
[info]   at 
org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564)
[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
[info]   at 
org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564)
[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
[info]   at 
org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
[info]   at scala.collection.immutable.List.foreach(List.scala:333)
[info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
[info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
[info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
[info]   at 
org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
[info]   at org.scalatest.Suite.run(Suite.scala:1114)
[info]   at 

(spark) branch master updated: [SPARK-45960][INFRA] Add Python 3.10 to the Daily Python Github Action job

2023-11-16 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 49cc081936c [SPARK-45960][INFRA] Add Python 3.10 to the Daily Python 
Github Action job
49cc081936c is described below

commit 49cc081936c1bae7abfe36b64941c243aca769a2
Author: Dongjoon Hyun 
AuthorDate: Fri Nov 17 08:40:06 2023 +0900

[SPARK-45960][INFRA] Add Python 3.10 to the Daily Python Github Action job

### What changes were proposed in this pull request?

This PR aims to enable `Python 3.10` testing in the following daily 
`Python-only` Github Action job.

https://github.com/apache/spark/actions/workflows/build_python.yml

### Why are the changes needed?

To provide `Python 3.10` test coverage to Apache Spark 4.0.0.

Since SPARK-45953 installed `Python 3.10` into the infra image, what we 
need is to add it to the daily job.
- #43840

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

We need to validate this in the daily GitHub Action job.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43847 from dongjoon-hyun/SPARK-45960.

Authored-by: Dongjoon Hyun 
Signed-off-by: Hyukjin Kwon 
---
 .github/workflows/build_python.yml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/.github/workflows/build_python.yml 
b/.github/workflows/build_python.yml
index 04b46ffca67..f89f13c2bba 100644
--- a/.github/workflows/build_python.yml
+++ b/.github/workflows/build_python.yml
@@ -17,7 +17,7 @@
 # under the License.
 #
 
-name: "Build / Python-only (master, PyPy 3.8)"
+name: "Build / Python-only (master, PyPy 3.8/Python 3.10)"
 
 on:
   schedule:
@@ -36,7 +36,7 @@ jobs:
   hadoop: hadoop3
   envs: >-
 {
-  "PYTHON_TO_TEST": "pypy3"
+  "PYTHON_TO_TEST": "pypy3,python3.10"
 }
   jobs: >-
 {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch branch-3.4 updated: [SPARK-45961][DOCS][3.4] Document `spark.master.*` configurations

2023-11-16 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 559c97b2498 [SPARK-45961][DOCS][3.4] Document `spark.master.*` 
configurations
559c97b2498 is described below

commit 559c97b2498ad4ef77c2b624c4ddf493497335bb
Author: Dongjoon Hyun 
AuthorDate: Thu Nov 16 15:38:08 2023 -0800

[SPARK-45961][DOCS][3.4] Document `spark.master.*` configurations

### What changes were proposed in this pull request?

This PR documents `spark.master.*` configurations.

### Why are the changes needed?

Currently, `spark.master.*` configurations are undocumented.
```
$ git grep 'ConfigBuilder("spark.master'
core/src/main/scala/org/apache/spark/internal/config/UI.scala:  val 
MASTER_UI_DECOMMISSION_ALLOW_MODE = 
ConfigBuilder("spark.master.ui.decommission.allow.mode")
core/src/main/scala/org/apache/spark/internal/config/package.scala:  
private[spark] val MASTER_REST_SERVER_ENABLED = 
ConfigBuilder("spark.master.rest.enabled")
core/src/main/scala/org/apache/spark/internal/config/package.scala:  
private[spark] val MASTER_REST_SERVER_PORT = 
ConfigBuilder("spark.master.rest.port")
core/src/main/scala/org/apache/spark/internal/config/package.scala:  
private[spark] val MASTER_UI_PORT = ConfigBuilder("spark.master.ui.port")
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review.

![Screenshot 2023-11-16 at 2 57 21 
PM](https://github.com/apache/spark/assets/9700541/6e9646d6-0144-4d10-bba8-500e9ce5e4cb)

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43850 from dongjoon-hyun/SPARK-45961-3.4.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 docs/spark-standalone.md | 35 +++
 1 file changed, 35 insertions(+)

diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md
index b388c2f3de1..5a60e63d415 100644
--- a/docs/spark-standalone.md
+++ b/docs/spark-standalone.md
@@ -190,6 +190,41 @@ SPARK_MASTER_OPTS supports the following system properties:
 
 
 Property NameDefaultMeaningSince 
Version
+
+  spark.master.ui.port
+  8080
+  
+Specifies the port number of the Master Web UI endpoint.
+  
+  1.1.0
+
+
+  spark.master.ui.decommission.allow.mode
+  LOCAL
+  
+Specifies the behavior of the Master Web UI's /workers/kill endpoint. 
Possible choices
+are: LOCAL means allow this endpoint from IP's that are local 
to the machine running
+the Master, DENY means to completely disable this endpoint, 
ALLOW means to allow
+calling this endpoint from any IP.
+  
+  3.1.0
+
+
+  spark.master.rest.enabled
+  false
+  
+Whether to use the Master REST API endpoint or not.
+  
+  1.3.0
+
+
+  spark.master.rest.port
+  6066
+  
+Specifies the port number of the Master REST API endpoint.
+  
+  1.3.0
+
 
   spark.deploy.retainedApplications
   200


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch branch-3.5 updated (f0054c5a10b -> e3549b25364)

2023-11-16 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


from f0054c5a10b [SPARK-45920][SQL][3.5] group by ordinal should be 
idempotent
 add e3549b25364 [SPARK-45961][DOCS][3.5] Document `spark.master.*` 
configurations

No new revisions were added by this update.

Summary of changes:
 docs/spark-standalone.md | 35 +++
 1 file changed, 35 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-45961][DOCS] Document `spark.master.*` configurations

2023-11-16 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c1dad17c48e [SPARK-45961][DOCS] Document `spark.master.*` 
configurations
c1dad17c48e is described below

commit c1dad17c48e17d30b284f4d6082766086d1cb7d4
Author: Dongjoon Hyun 
AuthorDate: Thu Nov 16 15:36:05 2023 -0800

[SPARK-45961][DOCS] Document `spark.master.*` configurations

### What changes were proposed in this pull request?

This PR documents `spark.master.*` configurations.

### Why are the changes needed?

Currently, `spark.master.*` configurations are undocumented.
```
$ git grep 'ConfigBuilder("spark.master'
core/src/main/scala/org/apache/spark/internal/config/UI.scala:  val 
MASTER_UI_DECOMMISSION_ALLOW_MODE = 
ConfigBuilder("spark.master.ui.decommission.allow.mode")
core/src/main/scala/org/apache/spark/internal/config/package.scala:  
private[spark] val MASTER_REST_SERVER_ENABLED = 
ConfigBuilder("spark.master.rest.enabled")
core/src/main/scala/org/apache/spark/internal/config/package.scala:  
private[spark] val MASTER_REST_SERVER_PORT = 
ConfigBuilder("spark.master.rest.port")
core/src/main/scala/org/apache/spark/internal/config/package.scala:  
private[spark] val MASTER_UI_PORT = ConfigBuilder("spark.master.ui.port")
core/src/main/scala/org/apache/spark/internal/config/package.scala:
ConfigBuilder("spark.master.ui.historyServerUrl")
core/src/main/scala/org/apache/spark/internal/config/package.scala:
ConfigBuilder("spark.master.useAppNameAsAppId.enabled")
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review.

![Screenshot 2023-11-16 at 2 48 37 
PM](https://github.com/apache/spark/assets/9700541/1fb90997-22be-4b2a-8db6-08f3db1340d9)

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43848 from dongjoon-hyun/SPARK-45961.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 docs/spark-standalone.md | 52 
 1 file changed, 52 insertions(+)

diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md
index c96839c6e95..ce739cb90b5 100644
--- a/docs/spark-standalone.md
+++ b/docs/spark-standalone.md
@@ -190,6 +190,58 @@ SPARK_MASTER_OPTS supports the following system properties:
 
 
 Property NameDefaultMeaningSince 
Version
+
+  spark.master.ui.port
+  8080
+  
+Specifies the port number of the Master Web UI endpoint.
+  
+  1.1.0
+
+
+  spark.master.ui.decommission.allow.mode
+  LOCAL
+  
+Specifies the behavior of the Master Web UI's /workers/kill endpoint. 
Possible choices
+are: LOCAL means allow this endpoint from IP's that are local 
to the machine running
+the Master, DENY means to completely disable this endpoint, 
ALLOW means to allow
+calling this endpoint from any IP.
+  
+  3.1.0
+
+
+  spark.master.ui.historyServerUrl
+  (None)
+  
+The URL where Spark history server is running. Please note that this 
assumes
+that all Spark jobs share the same event log location where the history 
server accesses.
+  
+  4.0.0
+
+
+  spark.master.rest.enabled
+  false
+  
+Whether to use the Master REST API endpoint or not.
+  
+  1.3.0
+
+
+  spark.master.rest.port
+  6066
+  
+Specifies the port number of the Master REST API endpoint.
+  
+  1.3.0
+
+
+  spark.master.useAppNameAsAppId.enabled
+  false
+  
+(Experimental) If true, Spark master uses the user-provided appName for 
appId.
+  
+  4.0.0
+
 
   spark.deploy.retainedApplications
   200


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-45958][BUILD] Upgrade Arrow to 14.0.1

2023-11-16 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f7859626472 [SPARK-45958][BUILD] Upgrade Arrow to 14.0.1
f7859626472 is described below

commit f785962647236a126cbba0db030af602b28e47d2
Author: Dongjoon Hyun 
AuthorDate: Thu Nov 16 14:31:23 2023 -0800

[SPARK-45958][BUILD] Upgrade Arrow to 14.0.1

### What changes were proposed in this pull request?

This PR aims to upgrade `Apache Arrow` to 14.0.1.

### Why are the changes needed?

- https://arrow.apache.org/release/14.0.1.html

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43846 from dongjoon-hyun/SPARK-45958.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 dev/deps/spark-deps-hadoop-3-hive-2.3 | 8 
 pom.xml   | 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index 0a952aa6ee8..eeb962cd62c 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -16,10 +16,10 @@ antlr4-runtime/4.13.1//antlr4-runtime-4.13.1.jar
 aopalliance-repackaged/2.6.1//aopalliance-repackaged-2.6.1.jar
 arpack/3.0.3//arpack-3.0.3.jar
 arpack_combined_all/0.1//arpack_combined_all-0.1.jar
-arrow-format/14.0.0//arrow-format-14.0.0.jar
-arrow-memory-core/14.0.0//arrow-memory-core-14.0.0.jar
-arrow-memory-netty/14.0.0//arrow-memory-netty-14.0.0.jar
-arrow-vector/14.0.0//arrow-vector-14.0.0.jar
+arrow-format/14.0.1//arrow-format-14.0.1.jar
+arrow-memory-core/14.0.1//arrow-memory-core-14.0.1.jar
+arrow-memory-netty/14.0.1//arrow-memory-netty-14.0.1.jar
+arrow-vector/14.0.1//arrow-vector-14.0.1.jar
 audience-annotations/0.5.0//audience-annotations-0.5.0.jar
 avro-ipc/1.11.3//avro-ipc-1.11.3.jar
 avro-mapred/1.11.3//avro-mapred-1.11.3.jar
diff --git a/pom.xml b/pom.xml
index f4aeb5d935b..7615904e610 100644
--- a/pom.xml
+++ b/pom.xml
@@ -229,7 +229,7 @@
 If you are changing Arrow version specification, please check
 ./python/pyspark/sql/pandas/utils.py, and ./python/setup.py too.
 -->
-14.0.0
+14.0.1
 2.5.11
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-45511][SS] Fix state reader suite flakiness by clean up resources after each test run

2023-11-16 Thread kabhwan
This is an automated email from the ASF dual-hosted git repository.

kabhwan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new aff9eab9039 [SPARK-45511][SS] Fix state reader suite flakiness by 
clean up resources after each test run
aff9eab9039 is described below

commit aff9eab90392f22c0037abdf50e6894615e4dbf9
Author: Chaoqin Li 
AuthorDate: Fri Nov 17 07:27:28 2023 +0900

[SPARK-45511][SS] Fix state reader suite flakiness by clean up resources 
after each test run

### What changes were proposed in this pull request?
Fix state reader suite flakiness by clean up resources after each test.

The reason we have to clean up StateStore per test is due to maintenance 
task. When we run the streaming query, state store is being initialized in to 
the executor, and registration is performed against the coordinator in driver. 
The lifecycle of the state store provider is not strictly tied to the the 
lifecycle of the streaming query - the executor closes the state store provider 
when coordinator indicates to the executor that the state store provider is no 
longer valid, which is not [...]

This means maintenance task against the provider can run after test A. We 
are clearing the temp directory in test A after the test A has completed, which 
can break the operation being performed against state store provider being used 
in test A. E.g. directory no longer exists while maintenance task is running.

This won't be an issue in practice because we do not expect the checkpoint 
location to be temporary, but it is indeed an issue for how we setup and 
cleanup env for tests.

### Why are the changes needed?

To deflake the test.

Closes #43831 from chaoqin-li1123/fix_state_reader_suite.

Authored-by: Chaoqin Li 
Signed-off-by: Jungtaek Lim 
---
 .../datasources/v2/state/StateDataSourceTestBase.scala   | 12 
 1 file changed, 12 insertions(+)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSourceTestBase.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSourceTestBase.scala
index 890a716bbef..f5392cc823f 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSourceTestBase.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSourceTestBase.scala
@@ -20,6 +20,7 @@ import java.sql.Timestamp
 
 import org.apache.spark.sql.{DataFrame, Dataset}
 import org.apache.spark.sql.execution.streaming.MemoryStream
+import org.apache.spark.sql.execution.streaming.state.StateStore
 import org.apache.spark.sql.functions._
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.streaming._
@@ -28,6 +29,17 @@ import org.apache.spark.sql.streaming.util.StreamManualClock
 trait StateDataSourceTestBase extends StreamTest with StateStoreMetricsTest {
   import testImplicits._
 
+  override def beforeEach(): Unit = {
+super.beforeEach()
+spark.streams.stateStoreCoordinator // initialize the lazy coordinator
+  }
+
+  override def afterEach(): Unit = {
+// Stop maintenance tasks because they may access already deleted 
checkpoint.
+StateStore.stop()
+super.afterEach()
+  }
+
   protected def runCompositeKeyStreamingAggregationQuery(checkpointRoot: 
String): Unit = {
 val inputData = MemoryStream[Int]
 val aggregated = getCompositeKeyStreamingAggregationQuery(inputData)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-45946][SS] Fix use of deprecated FileUtils write to pass default charset in RocksDBSuite

2023-11-16 Thread kabhwan
This is an automated email from the ASF dual-hosted git repository.

kabhwan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new bbe95cfcd05 [SPARK-45946][SS] Fix use of deprecated FileUtils write to 
pass default charset in RocksDBSuite
bbe95cfcd05 is described below

commit bbe95cfcd05728dca3810bbbf72c663729296587
Author: Anish Shrigondekar 
AuthorDate: Fri Nov 17 07:12:05 2023 +0900

[SPARK-45946][SS] Fix use of deprecated FileUtils write to pass default 
charset in RocksDBSuite

### What changes were proposed in this pull request?
Fix use of deprecated FileUtils write to pass default charset in 
RocksDBSuite

### Why are the changes needed?
Without the change, we were getting this compilation warning
```
[warn] 
/Users/anish.shrigondekar/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala:854:17:
 method write in class FileUtils is deprecated
[warn] Applicable -Wconf / nowarn filters for this warning: msg=, cat=deprecation, 
site=org.apache.spark.sql.execution.streaming.state.RocksDBSuite, 
origin=org.apache.commons.io.FileUtils.write
[warn]   FileUtils.write(file2, s"v2\n$json2")
[warn] ^
[warn] 
/Users/anish.shrigondekar/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala:1272:17:
 method write in class FileUtils is deprecated
[warn] Applicable -Wconf / nowarn filters for this warning: msg=, cat=deprecation, 
site=org.apache.spark.sql.execution.streaming.state.RocksDBSuite.generateFiles.$anonfun,
 origin=org.apache.commons.io.FileUtils.write
[warn]   FileUtils.write(file, "a" * length)
[warn]
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Ran test suite

```
22:47:45.700 WARN 
org.apache.spark.sql.execution.streaming.state.RocksDBSuite:

= POSSIBLE THREAD LEAK IN SUITE 
o.a.s.sql.execution.streaming.state.RocksDBSuite, threads: 
ForkJoinPool.commonPool-worker-6 (daemon=true), 
ForkJoinPool.commonPool-worker-4 (daemon=true), rpc-boss-3-1 (daemon=true), 
ForkJoinPool.commonPool-worker-5 (daemon=true), 
ForkJoinPool.commonPool-worker-3 (daemon=true), 
ForkJoinPool.commonPool-worker-2 (daemon=true), shuffle-boss-6-1 (daemon=true), 
ForkJoinPool.commonPool-worker-1 (daemon=true) =
[info] Run completed in 1 minute, 55 seconds.
[info] Total number of tests run: 77
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 77, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 172 s (02:52), completed Nov 15, 2023, 10:47:46 PM
```

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43832 from anishshri-db/task/SPARK-45946.

Authored-by: Anish Shrigondekar 
Signed-off-by: Jungtaek Lim 
---
 .../org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala
index ddef26224f2..e290f808f56 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala
@@ -851,7 +851,7 @@ class RocksDBSuite extends 
AlsoTestWithChangelogCheckpointingEnabled with Shared
 withTempDir { dir =>
   val file2 = new File(dir, "json")
   val json2 = """{"sstFiles":[],"numKeys":0}"""
-  FileUtils.write(file2, s"v2\n$json2")
+  FileUtils.write(file2, s"v2\n$json2", Charset.defaultCharset)
   val e = intercept[SparkException] {
 RocksDBCheckpointMetadata.readFromFile(file2)
   }
@@ -1269,7 +1269,7 @@ class RocksDBSuite extends 
AlsoTestWithChangelogCheckpointingEnabled with Shared
   def generateFiles(dir: String, fileToLengths: Seq[(String, Int)]): Unit = {
 fileToLengths.foreach { case (fileName, length) =>
   val file = new File(dir, fileName)
-  FileUtils.write(file, "a" * length)
+  FileUtils.write(file, "a" * length, Charset.defaultCharset)
 }
   }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-45953][INFRA] Add `Python 3.10` to Infra docker image

2023-11-16 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0b7736c1d12 [SPARK-45953][INFRA] Add `Python 3.10` to Infra docker 
image
0b7736c1d12 is described below

commit 0b7736c1d121947e418a356cf0431d9d7e969c90
Author: Dongjoon Hyun 
AuthorDate: Thu Nov 16 13:37:38 2023 -0800

[SPARK-45953][INFRA] Add `Python 3.10` to Infra docker image

### What changes were proposed in this pull request?

This PR aims to add `Python 3.10` to Infra docker images.

### Why are the changes needed?

This is a preparation to add a daily `Python 3.10` GitHub Action job later 
for Apache Spark 4.0.0.

Note that Python 3.10 is installed at the last step to avoid the following 
issues which happens when we install Python 3.9 and 3.10 at the same stage by 
package manager.
```
#21 13.03 ERROR: Cannot uninstall 'blinker'. It is a distutils installed 
project and thus we cannot accurately determine which files belong to it which 
would lead to only a partial uninstall.
#21 ERROR: process "/bin/sh -c python3.9 -m pip install numpy 
'pyarrow>=14.0.0' 'pandas<=2.1.3' scipy unittest-xml-reporting plotly>=4.8 
'mlflow>=2.3.1' coverage matplotlib openpyxl 'memory-profiler==0.60.0' 
'scikit-learn==1.1.*'" did not complete successfully: exit code: 1
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

1. I verified that the Python CI is not affected and still use Python 3.9.5 
only.
```

Running PySpark tests

Running PySpark tests. Output is in /__w/spark/spark/python/unit-tests.log
Will test against the following Python executables: ['python3.9']
Will test the following Python modules: ['pyspark-errors']
python3.9 python_implementation is CPython
python3.9 version is: Python 3.9.5
Starting test(python3.9): pyspark.errors.tests.test_errors (temp output: 
/__w/spark/spark/python/target/fd967f24-3607-4aa6-8190-3f8d7de522e1/python3.9__pyspark.errors.tests.test_errors___zauwgy1.log)
Finished test(python3.9): pyspark.errors.tests.test_errors (0s)
Tests passed in 0 seconds
```

2. Pass `Base Image Build` step for new Python 3.10.

![Screenshot 2023-11-16 at 10 53 37 
AM](https://github.com/apache/spark/assets/9700541/6bbb3461-c5f0-4d60-94f6-7cd8df0594ed)

3. Since new Python 3.10 is not used in CI, we need to validate like the 
following.

```
$ docker run -it --rm 
ghcr.io/dongjoon-hyun/apache-spark-ci-image:master-6895105871 python3.10 
--version
Python 3.10.13
```

```
$ docker run -it --rm 
ghcr.io/dongjoon-hyun/apache-spark-ci-image:master-6895105871 python3.10 -m pip 
freeze
alembic==1.12.1
annotated-types==0.6.0
blinker==1.7.0
certifi==2019.11.28
chardet==3.0.4
charset-normalizer==3.3.2
click==8.1.7
cloudpickle==2.2.1
contourpy==1.2.0
coverage==7.3.2
cycler==0.12.1
databricks-cli==0.18.0
dbus-python==1.2.16
deepspeed==0.12.3
distro-info==0.23+ubuntu1.1
docker==6.1.3
entrypoints==0.4
et-xmlfile==1.1.0
filelock==3.9.0
Flask==3.0.0
fonttools==4.44.3
gitdb==4.0.11
GitPython==3.1.40
googleapis-common-protos==1.56.4
greenlet==3.0.1
grpcio==1.56.2
grpcio-status==1.48.2
gunicorn==21.2.0
hjson==3.1.0
idna==2.8
importlib-metadata==6.8.0
itsdangerous==2.1.2
Jinja2==3.1.2
joblib==1.3.2
kiwisolver==1.4.5
lxml==4.9.3
Mako==1.3.0
Markdown==3.5.1
MarkupSafe==2.1.3
matplotlib==3.8.1
memory-profiler==0.60.0
mlflow==2.8.1
mpmath==1.3.0
networkx==3.0
ninja==1.11.1.1
numpy==1.26.2
oauthlib==3.2.2
openpyxl==3.1.2
packaging==23.2
pandas==2.1.3
Pillow==10.1.0
plotly==5.18.0
protobuf==3.20.3
psutil==5.9.6
py-cpuinfo==9.0.0
pyarrow==14.0.1
pydantic==2.5.1
pydantic_core==2.14.3
PyGObject==3.36.0
PyJWT==2.8.0
pynvml==11.5.0
pyparsing==3.1.1
python-apt==2.0.1+ubuntu0.20.4.1
python-dateutil==2.8.2
pytz==2023.3.post1
PyYAML==6.0.1
querystring-parser==1.2.4
requests==2.31.0
requests-unixsocket==0.2.0
scikit-learn==1.1.3
scipy==1.11.3
six==1.14.0
smmap==5.0.1
SQLAlchemy==2.0.23
sqlparse==0.4.4
sympy==1.12
tabulate==0.9.0
tenacity==8.2.3
threadpoolctl==3.2.0
torch==2.0.1+cpu
torcheval==0.0.7
torchvision==0.15.2+cpu
tqdm==4.66.1
typing_extensions==4.8.0
tzdata==2023.3
unattended-upgrades==0.1
unittest-xml-reporting==3.2.0
urllib3==2.1.0

(spark) branch master updated: [SPARK-45955][UI] Collapse Support for Flamegraph and thread dump details

2023-11-16 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new fab408f018e [SPARK-45955][UI] Collapse Support for Flamegraph and 
thread dump details
fab408f018e is described below

commit fab408f018e8bd77574a87ec72dee194d199aebc
Author: Kent Yao 
AuthorDate: Thu Nov 16 10:59:19 2023 -0800

[SPARK-45955][UI] Collapse Support for Flamegraph and thread dump details

### What changes were proposed in this pull request?

This PR supports collapse flamegraph and thread dump details like other 
pages


https://github.com/apache/spark/assets/8326978/ea5e224b-7edf-4bcd-bd83-c4243cdd7e58

### Why are the changes needed?

UX improvement for UI

### Does this PR introduce _any_ user-facing change?

yes, UI changes

### How was this patch tested?

As shown in the above video
### Was this patch authored or co-authored using generative AI tooling?

no

Closes #43842 from yaooqinn/SPARK-45955.

Authored-by: Kent Yao 
Signed-off-by: Dongjoon Hyun 
---
 .../org/apache/spark/ui/static/flamegraph.js | 14 +-
 .../resources/org/apache/spark/ui/static/table.js| 15 ++-
 .../spark/ui/exec/ExecutorThreadDumpPage.scala   | 20 +---
 3 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/core/src/main/resources/org/apache/spark/ui/static/flamegraph.js 
b/core/src/main/resources/org/apache/spark/ui/static/flamegraph.js
index c298dbaeed6..aeb80b280a3 100644
--- a/core/src/main/resources/org/apache/spark/ui/static/flamegraph.js
+++ b/core/src/main/resources/org/apache/spark/ui/static/flamegraph.js
@@ -15,7 +15,7 @@
  * limitations under the License.
  */
 
-/* global d3, flamegraph */
+/* global $, d3, flamegraph */
 
 /* eslint-disable no-unused-vars */
 function drawFlamegraph() {
@@ -33,4 +33,16 @@ function drawFlamegraph() {
 .call(chart);
   window.onresize = () => chart.width(width);
 }
+
+function toggleFlamegraph() {
+  const arrow = d3.select("#executor-flamegraph-arrow");
+  arrow.each(function () {
+$(this).toggleClass("arrow-open").toggleClass("arrow-closed")
+  });
+  if (arrow.classed("arrow-open")) {
+d3.select("#executor-flamegraph-chart").style("display", "block");
+  } else {
+d3.select("#executor-flamegraph-chart").style("display", "none");
+  }
+}
 /* eslint-enable no-unused-vars */
diff --git a/core/src/main/resources/org/apache/spark/ui/static/table.js 
b/core/src/main/resources/org/apache/spark/ui/static/table.js
index 0203748cf7d..839746762f4 100644
--- a/core/src/main/resources/org/apache/spark/ui/static/table.js
+++ b/core/src/main/resources/org/apache/spark/ui/static/table.js
@@ -15,7 +15,7 @@
  * limitations under the License.
  */
 
-/* global $ */
+/* global $, d3, collapseTable */
 /* eslint-disable no-unused-vars */
 /* Adds background colors to stripe table rows in the summary table (on the 
stage page). This is
  * necessary (instead of using css or the table striping provided by 
bootstrap) because the summary
@@ -109,3 +109,16 @@ function onSearchStringChange() {
   }
 }
 /* eslint-enable no-unused-vars */
+
+/* eslint-disable no-unused-vars */
+function collapseTableAndButton(thisName, table) {
+  collapseTable(thisName, table);
+
+  const t = d3.select("." + table);
+  if (t.classed("collapsed")) {
+d3.select("." + table + "-button").style("display", "none");
+  } else {
+d3.select("." + table + "-button").style("display", "flex");
+  }
+}
+/* eslint-enable no-unused-vars */
diff --git 
a/core/src/main/scala/org/apache/spark/ui/exec/ExecutorThreadDumpPage.scala 
b/core/src/main/scala/org/apache/spark/ui/exec/ExecutorThreadDumpPage.scala
index 328abdb5c5f..01d29897bef 100644
--- a/core/src/main/scala/org/apache/spark/ui/exec/ExecutorThreadDumpPage.scala
+++ b/core/src/main/scala/org/apache/spark/ui/exec/ExecutorThreadDumpPage.scala
@@ -82,7 +82,13 @@ private[ui] class ExecutorThreadDumpPage(
 {
   // scalastyle:off
   
-  
+  
+
+  
+  Thread Stack Trace
+
+  
+  
 Expand All
 Collapse All
 Download
@@ -98,9 +104,8 @@ private[ui] class ExecutorThreadDumpPage(
 
   
   
-  // scalastyle:on
 }
-
+
   
 Thread ID
 Thread Name
@@ -118,11 +123,20 @@ private[ui] class ExecutorThreadDumpPage(
 
 }.getOrElse(Text("Error fetching thread dump"))
 UIUtils.headerSparkPage(request, s"Thread dump for executor $executorId", 
content, parent)
+// scalastyle:on
   }
 
   // scalastyle:off
   private def drawExecutorFlamegraph(request: HttpServletRequest, thread: 
Array[ThreadStackTrace]): 

(spark) branch branch-3.3 updated: [SPARK-45920][SQL][3.3] group by ordinal should be idempotent

2023-11-16 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new c5b874da719 [SPARK-45920][SQL][3.3] group by ordinal should be 
idempotent
c5b874da719 is described below

commit c5b874da719183b2acdece3391be8493272f9d58
Author: Wenchen Fan 
AuthorDate: Thu Nov 16 08:20:22 2023 -0800

[SPARK-45920][SQL][3.3] group by ordinal should be idempotent

backport https://github.com/apache/spark/pull/43797

### What changes were proposed in this pull request?

GROUP BY ordinal is not idempotent today. If the ordinal points to another 
integer literal and the plan get analyzed again, we will re-do the ordinal 
resolution which can lead to wrong result or index out-of-bound error. This PR 
fixes it by using a hack: if the ordinal points to another integer literal, 
don't replace the ordinal.

### Why are the changes needed?

For advanced users or Spark plugins, they may manipulate the logical plans 
directly. We need to make the framework more reliable.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #43839 from cloud-fan/3.3-port.

Authored-by: Wenchen Fan 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/catalyst/analysis/Analyzer.scala | 14 -
 .../SubstituteUnresolvedOrdinalsSuite.scala| 23 --
 2 files changed, 34 insertions(+), 3 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
index d7bba23cf68..b4e520dd2e6 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@@ -1865,7 +1865,19 @@ class Analyzer(override val catalogManager: 
CatalogManager)
   throw 
QueryCompilationErrors.groupByPositionRefersToAggregateFunctionError(
 index, ordinalExpr)
 } else {
-  ordinalExpr
+  trimAliases(ordinalExpr) match {
+// HACK ALERT: If the ordinal expression is also an integer 
literal, don't use it
+// but still keep the ordinal literal. The reason 
is we may repeatedly
+// analyze the plan. Using a different integer 
literal may lead to
+// a repeat GROUP BY ordinal resolution which is 
wrong. GROUP BY
+// constant is meaningless so whatever value does 
not matter here.
+// TODO: (SPARK-45932) GROUP BY ordinal should pull out 
grouping expressions to
+//   a Project, then the resolved ordinal expression is 
always
+//   `AttributeReference`.
+case Literal(_: Int, IntegerType) =>
+  Literal(index)
+case _ => ordinalExpr
+  }
 }
   } else {
 throw QueryCompilationErrors.groupByPositionRangeError(index, 
aggs.size)
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/SubstituteUnresolvedOrdinalsSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/SubstituteUnresolvedOrdinalsSuite.scala
index c0312282c76..99fa62532f3 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/SubstituteUnresolvedOrdinalsSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/SubstituteUnresolvedOrdinalsSuite.scala
@@ -17,10 +17,11 @@
 
 package org.apache.spark.sql.catalyst.analysis
 
-import org.apache.spark.sql.catalyst.analysis.TestRelations.testRelation2
+import org.apache.spark.sql.catalyst.analysis.TestRelations.{testRelation, 
testRelation2}
 import org.apache.spark.sql.catalyst.dsl.expressions._
 import org.apache.spark.sql.catalyst.dsl.plans._
-import org.apache.spark.sql.catalyst.expressions.Literal
+import org.apache.spark.sql.catalyst.expressions.{GenericInternalRow, Literal}
+import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
 import org.apache.spark.sql.internal.SQLConf
 
 class SubstituteUnresolvedOrdinalsSuite extends AnalysisTest {
@@ -67,4 +68,22 @@ class SubstituteUnresolvedOrdinalsSuite extends AnalysisTest 
{
 testRelation2.groupBy(Literal(1), Literal(2))('a, 'b))
 }
   }
+
+  test("SPARK-45920: group by ordinal repeated analysis") {
+val plan = testRelation.groupBy(Literal(1))(Literal(100).as("a")).analyze
+comparePlans(
+  plan,
+  testRelation.groupBy(Literal(1))(Literal(100).as("a"))
+

(spark) branch branch-3.4 updated: [SPARK-45920][SQL][3.4] group by ordinal should be idempotent

2023-11-16 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new f927c0f24c5 [SPARK-45920][SQL][3.4] group by ordinal should be 
idempotent
f927c0f24c5 is described below

commit f927c0f24c51f10e4e56a09a15795fe4df0e007a
Author: Wenchen Fan 
AuthorDate: Thu Nov 16 08:18:40 2023 -0800

[SPARK-45920][SQL][3.4] group by ordinal should be idempotent

backport https://github.com/apache/spark/pull/43797

### What changes were proposed in this pull request?

GROUP BY ordinal is not idempotent today. If the ordinal points to another 
integer literal and the plan get analyzed again, we will re-do the ordinal 
resolution which can lead to wrong result or index out-of-bound error. This PR 
fixes it by using a hack: if the ordinal points to another integer literal, 
don't replace the ordinal.

### Why are the changes needed?

For advanced users or Spark plugins, they may manipulate the logical plans 
directly. We need to make the framework more reliable.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #43838 from cloud-fan/3.4-port.

Authored-by: Wenchen Fan 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/catalyst/analysis/Analyzer.scala | 14 -
 .../SubstituteUnresolvedOrdinalsSuite.scala| 23 --
 2 files changed, 34 insertions(+), 3 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
index b7d174089bc..c2efac4c84f 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@@ -1993,7 +1993,19 @@ class Analyzer(override val catalogManager: 
CatalogManager) extends RuleExecutor
   throw 
QueryCompilationErrors.groupByPositionRefersToAggregateFunctionError(
 index, ordinalExpr)
 } else {
-  ordinalExpr
+  trimAliases(ordinalExpr) match {
+// HACK ALERT: If the ordinal expression is also an integer 
literal, don't use it
+// but still keep the ordinal literal. The reason 
is we may repeatedly
+// analyze the plan. Using a different integer 
literal may lead to
+// a repeat GROUP BY ordinal resolution which is 
wrong. GROUP BY
+// constant is meaningless so whatever value does 
not matter here.
+// TODO: (SPARK-45932) GROUP BY ordinal should pull out 
grouping expressions to
+//   a Project, then the resolved ordinal expression is 
always
+//   `AttributeReference`.
+case Literal(_: Int, IntegerType) =>
+  Literal(index)
+case _ => ordinalExpr
+  }
 }
   } else {
 throw QueryCompilationErrors.groupByPositionRangeError(index, 
aggs.size)
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/SubstituteUnresolvedOrdinalsSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/SubstituteUnresolvedOrdinalsSuite.scala
index b0d7ace646e..953b2c8bb10 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/SubstituteUnresolvedOrdinalsSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/SubstituteUnresolvedOrdinalsSuite.scala
@@ -17,10 +17,11 @@
 
 package org.apache.spark.sql.catalyst.analysis
 
-import org.apache.spark.sql.catalyst.analysis.TestRelations.testRelation2
+import org.apache.spark.sql.catalyst.analysis.TestRelations.{testRelation, 
testRelation2}
 import org.apache.spark.sql.catalyst.dsl.expressions._
 import org.apache.spark.sql.catalyst.dsl.plans._
-import org.apache.spark.sql.catalyst.expressions.Literal
+import org.apache.spark.sql.catalyst.expressions.{GenericInternalRow, Literal}
+import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
 import org.apache.spark.sql.internal.SQLConf
 
 class SubstituteUnresolvedOrdinalsSuite extends AnalysisTest {
@@ -67,4 +68,22 @@ class SubstituteUnresolvedOrdinalsSuite extends AnalysisTest 
{
 testRelation2.groupBy(Literal(1), Literal(2))($"a", $"b"))
 }
   }
+
+  test("SPARK-45920: group by ordinal repeated analysis") {
+val plan = testRelation.groupBy(Literal(1))(Literal(100).as("a")).analyze
+comparePlans(
+  plan,
+  

(spark) branch branch-3.5 updated: [SPARK-45920][SQL][3.5] group by ordinal should be idempotent

2023-11-16 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new f0054c5a10b [SPARK-45920][SQL][3.5] group by ordinal should be 
idempotent
f0054c5a10b is described below

commit f0054c5a10bf388688e7b2914cb639c96ffdd8f3
Author: Wenchen Fan 
AuthorDate: Thu Nov 16 08:16:20 2023 -0800

[SPARK-45920][SQL][3.5] group by ordinal should be idempotent

backport https://github.com/apache/spark/pull/43797

### What changes were proposed in this pull request?

GROUP BY ordinal is not idempotent today. If the ordinal points to another 
integer literal and the plan get analyzed again, we will re-do the ordinal 
resolution which can lead to wrong result or index out-of-bound error. This PR 
fixes it by using a hack: if the ordinal points to another integer literal, 
don't replace the ordinal.

### Why are the changes needed?

For advanced users or Spark plugins, they may manipulate the logical plans 
directly. We need to make the framework more reliable.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #43836 from cloud-fan/3.5-port.

Authored-by: Wenchen Fan 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/catalyst/analysis/Analyzer.scala | 14 -
 .../SubstituteUnresolvedOrdinalsSuite.scala| 23 --
 .../analyzer-results/group-by-ordinal.sql.out  |  2 +-
 3 files changed, 35 insertions(+), 4 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
index 80cb5d8c608..02b9c244543 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@@ -1970,7 +1970,19 @@ class Analyzer(override val catalogManager: 
CatalogManager) extends RuleExecutor
   throw 
QueryCompilationErrors.groupByPositionRefersToAggregateFunctionError(
 index, ordinalExpr)
 } else {
-  ordinalExpr
+  trimAliases(ordinalExpr) match {
+// HACK ALERT: If the ordinal expression is also an integer 
literal, don't use it
+// but still keep the ordinal literal. The reason 
is we may repeatedly
+// analyze the plan. Using a different integer 
literal may lead to
+// a repeat GROUP BY ordinal resolution which is 
wrong. GROUP BY
+// constant is meaningless so whatever value does 
not matter here.
+// TODO: (SPARK-45932) GROUP BY ordinal should pull out 
grouping expressions to
+//   a Project, then the resolved ordinal expression is 
always
+//   `AttributeReference`.
+case Literal(_: Int, IntegerType) =>
+  Literal(index)
+case _ => ordinalExpr
+  }
 }
   } else {
 throw QueryCompilationErrors.groupByPositionRangeError(index, 
aggs.size)
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/SubstituteUnresolvedOrdinalsSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/SubstituteUnresolvedOrdinalsSuite.scala
index b0d7ace646e..953b2c8bb10 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/SubstituteUnresolvedOrdinalsSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/SubstituteUnresolvedOrdinalsSuite.scala
@@ -17,10 +17,11 @@
 
 package org.apache.spark.sql.catalyst.analysis
 
-import org.apache.spark.sql.catalyst.analysis.TestRelations.testRelation2
+import org.apache.spark.sql.catalyst.analysis.TestRelations.{testRelation, 
testRelation2}
 import org.apache.spark.sql.catalyst.dsl.expressions._
 import org.apache.spark.sql.catalyst.dsl.plans._
-import org.apache.spark.sql.catalyst.expressions.Literal
+import org.apache.spark.sql.catalyst.expressions.{GenericInternalRow, Literal}
+import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
 import org.apache.spark.sql.internal.SQLConf
 
 class SubstituteUnresolvedOrdinalsSuite extends AnalysisTest {
@@ -67,4 +68,22 @@ class SubstituteUnresolvedOrdinalsSuite extends AnalysisTest 
{
 testRelation2.groupBy(Literal(1), Literal(2))($"a", $"b"))
 }
   }
+
+  test("SPARK-45920: group by ordinal repeated analysis") {
+val plan = testRelation.groupBy(Literal(1))(Literal(100).as("a")).analyze
+

(spark) branch master updated: [SPARK-45951][INFRA] Upgrade `buf` to v1.28.1

2023-11-16 Thread yangjie01
This is an automated email from the ASF dual-hosted git repository.

yangjie01 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 334d952f955 [SPARK-45951][INFRA] Upgrade `buf` to v1.28.1
334d952f955 is described below

commit 334d952f9555cbfad8ef84987d6f978eb6b37b9b
Author: Ruifeng Zheng 
AuthorDate: Thu Nov 16 21:41:41 2023 +0800

[SPARK-45951][INFRA] Upgrade `buf` to v1.28.1

### What changes were proposed in this pull request?
Upgrade `buf` to v1.28.1

### Why are the changes needed?
`buf` was upgraded to 1.26.1 two months ago, so I think it is time to 
upgrade it again.

this upgrade cause no change in generated codes, and it fixed multiple 
issues:

https://github.com/bufbuild/buf/releases

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #43835 from zhengruifeng/connect_buf_1_28_1.

Lead-authored-by: Ruifeng Zheng 
Co-authored-by: Ruifeng Zheng 
Signed-off-by: yangjie01 
---
 .github/workflows/build_and_test.yml| 2 +-
 python/docs/source/development/contributing.rst | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 25af93af280..95ce051f32f 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -704,7 +704,7 @@ jobs:
   if: inputs.branch != 'branch-3.3' && inputs.branch != 'branch-3.4'
   run: |
 # See more in "Installation" 
https://docs.buf.build/installation#tarball
-curl -LO 
https://github.com/bufbuild/buf/releases/download/v1.26.1/buf-Linux-x86_64.tar.gz
+curl -LO 
https://github.com/bufbuild/buf/releases/download/v1.28.1/buf-Linux-x86_64.tar.gz
 mkdir -p $HOME/buf
 tar -xvzf buf-Linux-x86_64.tar.gz -C $HOME/buf --strip-components 1
 rm buf-Linux-x86_64.tar.gz
diff --git a/python/docs/source/development/contributing.rst 
b/python/docs/source/development/contributing.rst
index d6d5283c1e3..ad61ba95d69 100644
--- a/python/docs/source/development/contributing.rst
+++ b/python/docs/source/development/contributing.rst
@@ -120,7 +120,7 @@ Prerequisite
 
 PySpark development requires to build Spark that needs a proper JDK installed, 
etc. See `Building Spark 
`_ for more details.
 
-Note that if you intend to contribute to Spark Connect in Python, ``buf`` 
version ``1.26.1`` is required, see `Buf Installation 
`_ for more details.
+Note that if you intend to contribute to Spark Connect in Python, ``buf`` 
version ``1.28.1`` is required, see `Buf Installation 
`_ for more details.
 
 Conda
 ~


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-45851][CONNECT][SCALA] Support multiple policies in scala client

2023-11-16 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 182e2d236c5 [SPARK-45851][CONNECT][SCALA] Support multiple policies in 
scala client
182e2d236c5 is described below

commit 182e2d236c5c39f3c4dba248d6df77eb9c363dfd
Author: Alice Sayutina 
AuthorDate: Thu Nov 16 19:41:10 2023 +0900

[SPARK-45851][CONNECT][SCALA] Support multiple policies in scala client

### What changes were proposed in this pull request?

Support multiple retry policies defined at the same time. Each policy 
determines which error types it can retry and how exactly those should be 
spread out.

Scala parity for https://github.com/apache/spark/pull/43591

### Why are the changes needed?

Different error types should be treated differently For instance, 
networking connectivity errors and remote resources being initialized should be 
treated separately.

### Does this PR introduce _any_ user-facing change?
No (as long as user doesn't poke within client internals).

### How was this patch tested?
Unit tests, some hand testing.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43757 from cdkrot/SPARK-45851-scala-multiple-policies.

Authored-by: Alice Sayutina 
Signed-off-by: Hyukjin Kwon 
---
 .../spark/sql/connect/client/ArtifactSuite.scala   |   4 +-
 .../connect/client/SparkConnectClientSuite.scala   |  71 +++--
 .../apache/spark/sql/test/RemoteSparkSession.scala |   7 +-
 .../client/CustomSparkConnectBlockingStub.scala|   4 +-
 .../ExecutePlanResponseReattachableIterator.scala  |  16 +-
 .../sql/connect/client/GrpcRetryHandler.scala  | 166 -
 .../spark/sql/connect/client/RetriesExceeded.scala |  25 
 .../spark/sql/connect/client/RetryPolicy.scala | 134 +
 .../sql/connect/client/SparkConnectClient.scala|  12 +-
 .../sql/connect/client/SparkConnectStubState.scala |  10 +-
 .../spark/sql/connect/SparkConnectServerTest.scala |   6 +-
 11 files changed, 311 insertions(+), 144 deletions(-)

diff --git 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/ArtifactSuite.scala
 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/ArtifactSuite.scala
index 79aba053ea0..f945313d242 100644
--- 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/ArtifactSuite.scala
+++ 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/ArtifactSuite.scala
@@ -42,7 +42,6 @@ class ArtifactSuite extends ConnectFunSuite with 
BeforeAndAfterEach {
   private var server: Server = _
   private var artifactManager: ArtifactManager = _
   private var channel: ManagedChannel = _
-  private var retryPolicy: GrpcRetryHandler.RetryPolicy = _
   private var bstub: CustomSparkConnectBlockingStub = _
   private var stub: CustomSparkConnectStub = _
   private var state: SparkConnectStubState = _
@@ -58,8 +57,7 @@ class ArtifactSuite extends ConnectFunSuite with 
BeforeAndAfterEach {
 
   private def createArtifactManager(): Unit = {
 channel = 
InProcessChannelBuilder.forName(getClass.getName).directExecutor().build()
-retryPolicy = GrpcRetryHandler.RetryPolicy()
-state = new SparkConnectStubState(channel, retryPolicy)
+state = new SparkConnectStubState(channel, RetryPolicy.defaultPolicies())
 bstub = new CustomSparkConnectBlockingStub(channel, state)
 stub = new CustomSparkConnectStub(channel, state)
 artifactManager = new ArtifactManager(Configuration(), "", bstub, stub)
diff --git 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/SparkConnectClientSuite.scala
 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/SparkConnectClientSuite.scala
index b93713383b2..e226484d87a 100644
--- 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/SparkConnectClientSuite.scala
+++ 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/SparkConnectClientSuite.scala
@@ -119,7 +119,7 @@ class SparkConnectClientSuite extends ConnectFunSuite with 
BeforeAndAfterEach {
 client = SparkConnectClient
   .builder()
   .connectionString(s"sc://localhost:${server.getPort}/;use_ssl=true")
-  .retryPolicy(GrpcRetryHandler.RetryPolicy(maxRetries = 0))
+  .retryPolicy(RetryPolicy(maxRetries = Some(0), canRetry = _ => false, 
name = "TestPolicy"))
   .build()
 
 val request = 
AnalyzePlanRequest.newBuilder().setSessionId("abc123").build()
@@ -311,7 +311,7 @@ class SparkConnectClientSuite extends ConnectFunSuite with 
BeforeAndAfterEach {
 }
   }
 
-  private class DummyFn(val e: Throwable, numFails: Int = 3) {
+  private 

(spark) branch branch-3.3 updated: [SPARK-45764][PYTHON][DOCS][3.3] Make code block copyable

2023-11-16 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new bb0dadd9761 [SPARK-45764][PYTHON][DOCS][3.3] Make code block copyable
bb0dadd9761 is described below

commit bb0dadd97613cd2779781da4f6ddb6869e4007e4
Author: panbingkun 
AuthorDate: Thu Nov 16 18:11:07 2023 +0800

[SPARK-45764][PYTHON][DOCS][3.3] Make code block copyable

### What changes were proposed in this pull request?
The pr aims to make code block `copyable `in pyspark docs.
Backport above to `branch 3.3`.
Master branch pr: https://github.com/apache/spark/pull/43799

### Why are the changes needed?
Improving the usability of PySpark documents.

### Does this PR introduce _any_ user-facing change?
Yes, users will be able to easily copy code block in pyspark docs.

### How was this patch tested?
- Manually test.
- Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #43830 from panbingkun/branch-3.3_SPARK-45764.

Authored-by: panbingkun 
Signed-off-by: Ruifeng Zheng 
---
 .github/workflows/build_and_test.yml |  2 +-
 LICENSE  |  5 ---
 dev/create-release/spark-rm/Dockerfile   |  2 +-
 dev/requirements.txt |  1 +
 licenses/LICENSE-copybutton.txt  | 49 
 python/docs/source/_static/copybutton.js | 66 
 python/docs/source/conf.py   |  7 ++--
 7 files changed, 7 insertions(+), 125 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 0dc23a3788a..1ab5cc9c3b5 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -535,7 +535,7 @@ jobs:
 #   See also https://issues.apache.org/jira/browse/SPARK-35375.
 # Pin the MarkupSafe to 2.0.1 to resolve the CI error.
 #   See also https://issues.apache.org/jira/browse/SPARK-38279.
-python3.9 -m pip install 'sphinx<3.1.0' mkdocs pydata_sphinx_theme 
ipython nbsphinx numpydoc 'jinja2<3.0.0' 'markupsafe==2.0.1' 'pyzmq<24.0.0'
+python3.9 -m pip install 'sphinx<3.1.0' mkdocs pydata_sphinx_theme 
'sphinx-copybutton==0.5.2' ipython nbsphinx numpydoc 'jinja2<3.0.0' 
'markupsafe==2.0.1' 'pyzmq<24.0.0'
 python3.9 -m pip install ipython_genutils # See SPARK-38517
 python3.9 -m pip install sphinx_plotly_directive 'numpy>=1.20.0' 
pyarrow pandas 'plotly>=4.8' 
 python3.9 -m pip install 'docutils<0.18.0' # See SPARK-39421
diff --git a/LICENSE b/LICENSE
index df6bed16f44..ed5006b5546 100644
--- a/LICENSE
+++ b/LICENSE
@@ -219,11 +219,6 @@ docs/js/vendor/bootstrap.js
 
external/spark-ganglia-lgpl/src/main/java/com/codahale/metrics/ganglia/GangliaReporter.java
 
 
-Python Software Foundation License
---
-
-python/docs/source/_static/copybutton.js
-
 BSD 3-Clause
 
 
diff --git a/dev/create-release/spark-rm/Dockerfile 
b/dev/create-release/spark-rm/Dockerfile
index c6555e0463d..ca3b1f39413 100644
--- a/dev/create-release/spark-rm/Dockerfile
+++ b/dev/create-release/spark-rm/Dockerfile
@@ -42,7 +42,7 @@ ARG APT_INSTALL="apt-get install --no-install-recommends -y"
 #   We should use the latest Sphinx version once this is fixed.
 # TODO(SPARK-35375): Jinja2 3.0.0+ causes error when building with Sphinx.
 #   See also https://issues.apache.org/jira/browse/SPARK-35375.
-ARG PIP_PKGS="sphinx==3.0.4 mkdocs==1.1.2 numpy==1.19.4 
pydata_sphinx_theme==0.4.1 ipython==7.19.0 nbsphinx==0.8.0 numpydoc==1.1.0 
jinja2==2.11.3 twine==3.4.1 sphinx-plotly-directive==0.1.3 pandas==1.1.5 
pyarrow==3.0.0 plotly==5.4.0 markupsafe==2.0.1 docutils<0.17"
+ARG PIP_PKGS="sphinx==3.0.4 mkdocs==1.1.2 numpy==1.19.4 
pydata_sphinx_theme==0.4.1 ipython==7.19.0 nbsphinx==0.8.0 numpydoc==1.1.0 
jinja2==2.11.3 twine==3.4.1 sphinx-plotly-directive==0.1.3 
sphinx-copybutton==0.5.2 pandas==1.1.5 pyarrow==3.0.0 plotly==5.4.0 
markupsafe==2.0.1 docutils<0.17"
 ARG GEM_PKGS="bundler:2.2.9"
 
 # Install extra needed repos and refresh.
diff --git a/dev/requirements.txt b/dev/requirements.txt
index 79a70624312..d5114ebbca5 100644
--- a/dev/requirements.txt
+++ b/dev/requirements.txt
@@ -35,6 +35,7 @@ numpydoc
 jinja2<3.0.0
 sphinx<3.1.0
 sphinx-plotly-directive
+sphinx-copybutton<0.5.3
 docutils<0.18.0
 
 # Development scripts
diff --git a/licenses/LICENSE-copybutton.txt b/licenses/LICENSE-copybutton.txt
deleted file mode 100644
index 45be6b83a53..000
--- a/licenses/LICENSE-copybutton.txt
+++ /dev/null
@@ -1,49 +0,0 @@
-PYTHON SOFTWARE FOUNDATION LICENSE VERSION 2
-
-
-1. This LICENSE AGREEMENT is between the Python Software Foundation
-("PSF"), and the 

(spark) branch branch-3.4 updated: [SPARK-45764][PYTHON][DOCS][3.4] Make code block copyable

2023-11-16 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new c58df4a7adf [SPARK-45764][PYTHON][DOCS][3.4] Make code block copyable
c58df4a7adf is described below

commit c58df4a7adfb9cbdfde407092416fc0dbf5e2867
Author: panbingkun 
AuthorDate: Thu Nov 16 18:09:10 2023 +0800

[SPARK-45764][PYTHON][DOCS][3.4] Make code block copyable

### What changes were proposed in this pull request?
The pr aims to make code block `copyable `in pyspark docs.
Backport above to `branch 3.4`.
Master branch pr: https://github.com/apache/spark/pull/43799

### Why are the changes needed?
Improving the usability of PySpark documents.

### Does this PR introduce _any_ user-facing change?
Yes, users will be able to easily copy code block in pyspark docs.

### How was this patch tested?
- Manually test.
- Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #43828 from panbingkun/branch-3.4_SPARK-45764.

Authored-by: panbingkun 
Signed-off-by: Ruifeng Zheng 
---
 .github/workflows/build_and_test.yml |  2 +-
 LICENSE  |  5 ---
 dev/create-release/spark-rm/Dockerfile   |  2 +-
 dev/requirements.txt |  1 +
 licenses/LICENSE-copybutton.txt  | 49 ---
 python/docs/source/_static/copybutton.js | 67 
 python/docs/source/conf.py   |  7 ++--
 7 files changed, 7 insertions(+), 126 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index aff173b8e51..2dd78581db2 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -620,7 +620,7 @@ jobs:
 #   See also https://issues.apache.org/jira/browse/SPARK-35375.
 # Pin the MarkupSafe to 2.0.1 to resolve the CI error.
 #   See also https://issues.apache.org/jira/browse/SPARK-38279.
-python3.9 -m pip install 'sphinx<3.1.0' mkdocs pydata_sphinx_theme 
nbsphinx numpydoc 'jinja2<3.0.0' 'markupsafe==2.0.1' 'pyzmq<24.0.0'
+python3.9 -m pip install 'sphinx<3.1.0' mkdocs pydata_sphinx_theme 
'sphinx-copybutton==0.5.2' nbsphinx numpydoc 'jinja2<3.0.0' 'markupsafe==2.0.1' 
'pyzmq<24.0.0'
 python3.9 -m pip install ipython_genutils # See SPARK-38517
 python3.9 -m pip install sphinx_plotly_directive 'numpy>=1.20.0' 
pyarrow pandas 'plotly>=4.8'
 python3.9 -m pip install 'docutils<0.18.0' # See SPARK-39421
diff --git a/LICENSE b/LICENSE
index 012fdbca4c9..f4564cf6118 100644
--- a/LICENSE
+++ b/LICENSE
@@ -219,11 +219,6 @@ docs/js/vendor/bootstrap.js
 
connector/spark-ganglia-lgpl/src/main/java/com/codahale/metrics/ganglia/GangliaReporter.java
 
 
-Python Software Foundation License
---
-
-python/docs/source/_static/copybutton.js
-
 BSD 3-Clause
 
 
diff --git a/dev/create-release/spark-rm/Dockerfile 
b/dev/create-release/spark-rm/Dockerfile
index 6995928beae..340a57b0c08 100644
--- a/dev/create-release/spark-rm/Dockerfile
+++ b/dev/create-release/spark-rm/Dockerfile
@@ -42,7 +42,7 @@ ARG APT_INSTALL="apt-get install --no-install-recommends -y"
 #   We should use the latest Sphinx version once this is fixed.
 # TODO(SPARK-35375): Jinja2 3.0.0+ causes error when building with Sphinx.
 #   See also https://issues.apache.org/jira/browse/SPARK-35375.
-ARG PIP_PKGS="sphinx==3.0.4 mkdocs==1.1.2 numpy==1.20.3 
pydata_sphinx_theme==0.4.1 ipython==7.19.0 nbsphinx==0.8.0 numpydoc==1.1.0 
jinja2==2.11.3 twine==3.4.1 sphinx-plotly-directive==0.1.3 pandas==1.5.3 
pyarrow==3.0.0 plotly==5.4.0 markupsafe==2.0.1 docutils<0.17 grpcio==1.48.1 
protobuf==4.21.6 grpcio-status==1.48.1 googleapis-common-protos==1.56.4"
+ARG PIP_PKGS="sphinx==3.0.4 mkdocs==1.1.2 numpy==1.20.3 
pydata_sphinx_theme==0.4.1 ipython==7.19.0 nbsphinx==0.8.0 numpydoc==1.1.0 
jinja2==2.11.3 twine==3.4.1 sphinx-plotly-directive==0.1.3 
sphinx-copybutton==0.5.2 pandas==1.5.3 pyarrow==3.0.0 plotly==5.4.0 
markupsafe==2.0.1 docutils<0.17 grpcio==1.48.1 protobuf==4.21.6 
grpcio-status==1.48.1 googleapis-common-protos==1.56.4"
 ARG GEM_PKGS="bundler:2.3.8"
 
 # Install extra needed repos and refresh.
diff --git a/dev/requirements.txt b/dev/requirements.txt
index c54c5ea770c..8226d88714e 100644
--- a/dev/requirements.txt
+++ b/dev/requirements.txt
@@ -37,6 +37,7 @@ numpydoc
 jinja2<3.0.0
 sphinx<3.1.0
 sphinx-plotly-directive
+sphinx-copybutton<0.5.3
 docutils<0.18.0
 # See SPARK-38279.
 markupsafe==2.0.1
diff --git a/licenses/LICENSE-copybutton.txt b/licenses/LICENSE-copybutton.txt
deleted file mode 100644
index 45be6b83a53..000
--- a/licenses/LICENSE-copybutton.txt
+++ /dev/null
@@ -1,49 +0,0 @@
-PYTHON 

(spark) branch master updated (79ccdfa31e2 -> 1a651753f4e)

2023-11-16 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 79ccdfa31e2 [SPARK-45935][PYTHON][DOCS] Fix RST files link 
substitutions error
 add 1a651753f4e [SPARK-45945][CONNECT] Add a helper function for `parser`

No new revisions were added by this update.

Summary of changes:
 .../sql/connect/planner/SparkConnectPlanner.scala  | 22 --
 1 file changed, 8 insertions(+), 14 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch branch-3.3 updated: [SPARK-45935][PYTHON][DOCS] Fix RST files link substitutions error

2023-11-16 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 8e4efb287cb [SPARK-45935][PYTHON][DOCS] Fix RST files link 
substitutions error
8e4efb287cb is described below

commit 8e4efb287cb0dcb4317bda7e66143eefe92ec984
Author: panbingkun 
AuthorDate: Thu Nov 16 18:00:56 2023 +0900

[SPARK-45935][PYTHON][DOCS] Fix RST files link substitutions error

### What changes were proposed in this pull request?
The pr aims to fix RST files `link substitutions` error.
Target branch: branch-3.3, branch-3.4, branch-3.5, master.

### Why are the changes needed?
When I was reviewing Python documents, I found that `the actual address` of 
the link was incorrect, eg:

https://spark.apache.org/docs/latest/api/python/getting_started/install.html#installing-from-source
https://github.com/apache/spark/assets/15246973/069c1875-1e21-45db-a236-15c27ee7b913;>

`The ref link url` of `Building Spark`: from 
`https://spark.apache.org/docs/3.5.0/#downloading` to 
`https://spark.apache.org/docs/3.5.0/building-spark.html`.
We should fix it.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manually test.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #43815 from panbingkun/SPARK-45935.

Authored-by: panbingkun 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 79ccdfa31e282ebe9a82c8f20c703b6ad2ea6bc1)
Signed-off-by: Hyukjin Kwon 
---
 python/docs/source/conf.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/python/docs/source/conf.py b/python/docs/source/conf.py
index 2e2b5201262..0eb037687cf 100644
--- a/python/docs/source/conf.py
+++ b/python/docs/source/conf.py
@@ -83,9 +83,9 @@ rst_epilog = """
 .. |examples| replace:: Examples
 .. _examples: https://github.com/apache/spark/tree/{0}/examples/src/main/python
 .. |downloading| replace:: Downloading
-.. _downloading: https://spark.apache.org/docs/{1}/building-spark.html
+.. _downloading: https://spark.apache.org/docs/{1}/#downloading
 .. |building_spark| replace:: Building Spark
-.. _building_spark: https://spark.apache.org/docs/{1}/#downloading
+.. _building_spark: https://spark.apache.org/docs/{1}/building-spark.html
 """.format(
 os.environ.get("GIT_HASH", "master"),
 os.environ.get("RELEASE_VERSION", "latest"),


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch branch-3.4 updated: [SPARK-45935][PYTHON][DOCS] Fix RST files link substitutions error

2023-11-16 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 83439ee526e [SPARK-45935][PYTHON][DOCS] Fix RST files link 
substitutions error
83439ee526e is described below

commit 83439ee526e70c0e053c130588c4c2b39d5e075e
Author: panbingkun 
AuthorDate: Thu Nov 16 18:00:56 2023 +0900

[SPARK-45935][PYTHON][DOCS] Fix RST files link substitutions error

### What changes were proposed in this pull request?
The pr aims to fix RST files `link substitutions` error.
Target branch: branch-3.3, branch-3.4, branch-3.5, master.

### Why are the changes needed?
When I was reviewing Python documents, I found that `the actual address` of 
the link was incorrect, eg:

https://spark.apache.org/docs/latest/api/python/getting_started/install.html#installing-from-source
https://github.com/apache/spark/assets/15246973/069c1875-1e21-45db-a236-15c27ee7b913;>

`The ref link url` of `Building Spark`: from 
`https://spark.apache.org/docs/3.5.0/#downloading` to 
`https://spark.apache.org/docs/3.5.0/building-spark.html`.
We should fix it.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manually test.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #43815 from panbingkun/SPARK-45935.

Authored-by: panbingkun 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 79ccdfa31e282ebe9a82c8f20c703b6ad2ea6bc1)
Signed-off-by: Hyukjin Kwon 
---
 python/docs/source/conf.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/python/docs/source/conf.py b/python/docs/source/conf.py
index 38c331048e7..840f6c641cc 100644
--- a/python/docs/source/conf.py
+++ b/python/docs/source/conf.py
@@ -94,9 +94,9 @@ rst_epilog = """
 .. |examples| replace:: Examples
 .. _examples: https://github.com/apache/spark/tree/{0}/examples/src/main/python
 .. |downloading| replace:: Downloading
-.. _downloading: https://spark.apache.org/docs/{1}/building-spark.html
+.. _downloading: https://spark.apache.org/docs/{1}/#downloading
 .. |building_spark| replace:: Building Spark
-.. _building_spark: https://spark.apache.org/docs/{1}/#downloading
+.. _building_spark: https://spark.apache.org/docs/{1}/building-spark.html
 """.format(
 os.environ.get("GIT_HASH", "master"),
 os.environ.get("RELEASE_VERSION", "latest"),


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch branch-3.5 updated: [SPARK-45935][PYTHON][DOCS] Fix RST files link substitutions error

2023-11-16 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new b962cb26ed2 [SPARK-45935][PYTHON][DOCS] Fix RST files link 
substitutions error
b962cb26ed2 is described below

commit b962cb26ed20d695e408958be452f0a947e7e989
Author: panbingkun 
AuthorDate: Thu Nov 16 18:00:56 2023 +0900

[SPARK-45935][PYTHON][DOCS] Fix RST files link substitutions error

### What changes were proposed in this pull request?
The pr aims to fix RST files `link substitutions` error.
Target branch: branch-3.3, branch-3.4, branch-3.5, master.

### Why are the changes needed?
When I was reviewing Python documents, I found that `the actual address` of 
the link was incorrect, eg:

https://spark.apache.org/docs/latest/api/python/getting_started/install.html#installing-from-source
https://github.com/apache/spark/assets/15246973/069c1875-1e21-45db-a236-15c27ee7b913;>

`The ref link url` of `Building Spark`: from 
`https://spark.apache.org/docs/3.5.0/#downloading` to 
`https://spark.apache.org/docs/3.5.0/building-spark.html`.
We should fix it.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manually test.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #43815 from panbingkun/SPARK-45935.

Authored-by: panbingkun 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 79ccdfa31e282ebe9a82c8f20c703b6ad2ea6bc1)
Signed-off-by: Hyukjin Kwon 
---
 python/docs/source/conf.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/python/docs/source/conf.py b/python/docs/source/conf.py
index a0d087de176..08a25c5dd07 100644
--- a/python/docs/source/conf.py
+++ b/python/docs/source/conf.py
@@ -98,9 +98,9 @@ rst_epilog = """
 .. |examples| replace:: Examples
 .. _examples: https://github.com/apache/spark/tree/{0}/examples/src/main/python
 .. |downloading| replace:: Downloading
-.. _downloading: https://spark.apache.org/docs/{1}/building-spark.html
+.. _downloading: https://spark.apache.org/docs/{1}/#downloading
 .. |building_spark| replace:: Building Spark
-.. _building_spark: https://spark.apache.org/docs/{1}/#downloading
+.. _building_spark: https://spark.apache.org/docs/{1}/building-spark.html
 """.format(
 os.environ.get("GIT_HASH", "master"),
 os.environ.get("RELEASE_VERSION", "latest"),


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-45935][PYTHON][DOCS] Fix RST files link substitutions error

2023-11-16 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 79ccdfa31e2 [SPARK-45935][PYTHON][DOCS] Fix RST files link 
substitutions error
79ccdfa31e2 is described below

commit 79ccdfa31e282ebe9a82c8f20c703b6ad2ea6bc1
Author: panbingkun 
AuthorDate: Thu Nov 16 18:00:56 2023 +0900

[SPARK-45935][PYTHON][DOCS] Fix RST files link substitutions error

### What changes were proposed in this pull request?
The pr aims to fix RST files `link substitutions` error.
Target branch: branch-3.3, branch-3.4, branch-3.5, master.

### Why are the changes needed?
When I was reviewing Python documents, I found that `the actual address` of 
the link was incorrect, eg:

https://spark.apache.org/docs/latest/api/python/getting_started/install.html#installing-from-source
https://github.com/apache/spark/assets/15246973/069c1875-1e21-45db-a236-15c27ee7b913;>

`The ref link url` of `Building Spark`: from 
`https://spark.apache.org/docs/3.5.0/#downloading` to 
`https://spark.apache.org/docs/3.5.0/building-spark.html`.
We should fix it.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manually test.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #43815 from panbingkun/SPARK-45935.

Authored-by: panbingkun 
Signed-off-by: Hyukjin Kwon 
---
 python/docs/source/conf.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/python/docs/source/conf.py b/python/docs/source/conf.py
index 9fd50b6c976..b9884d55b3a 100644
--- a/python/docs/source/conf.py
+++ b/python/docs/source/conf.py
@@ -102,9 +102,9 @@ rst_epilog = """
 .. |examples| replace:: Examples
 .. _examples: https://github.com/apache/spark/tree/{0}/examples/src/main/python
 .. |downloading| replace:: Downloading
-.. _downloading: https://spark.apache.org/docs/{1}/building-spark.html
+.. _downloading: https://spark.apache.org/docs/{1}/#downloading
 .. |building_spark| replace:: Building Spark
-.. _building_spark: https://spark.apache.org/docs/{1}/#downloading
+.. _building_spark: https://spark.apache.org/docs/{1}/building-spark.html
 """.format(
 os.environ.get("GIT_HASH", "master"),
 os.environ.get("RELEASE_VERSION", "latest"),


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch branch-3.5 updated: [SPARK-45764][PYTHON][DOCS][3.5] Make code block copyable

2023-11-16 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 44bd909ef9e [SPARK-45764][PYTHON][DOCS][3.5] Make code block copyable
44bd909ef9e is described below

commit 44bd909ef9e6f4d5419b5757a265fa9ead001cbb
Author: panbingkun 
AuthorDate: Thu Nov 16 00:52:48 2023 -0800

[SPARK-45764][PYTHON][DOCS][3.5] Make code block copyable

### What changes were proposed in this pull request?
The pr aims to make code block `copyable `in pyspark docs.
Backport above to `branch 3.5`.
Master branch pr: https://github.com/apache/spark/pull/43799

### Why are the changes needed?
Improving the usability of PySpark documents.

### Does this PR introduce _any_ user-facing change?
Yes, users will be able to easily copy code block in pyspark docs.

### How was this patch tested?
- Manually test.
- Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #43827 from panbingkun/branch-3.5_SPARK-45764.

Authored-by: panbingkun 
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml |  2 +-
 LICENSE  |  5 ---
 dev/create-release/spark-rm/Dockerfile   |  2 +-
 dev/requirements.txt |  1 +
 licenses/LICENSE-copybutton.txt  | 49 ---
 python/docs/source/_static/copybutton.js | 67 
 python/docs/source/conf.py   |  7 ++--
 7 files changed, 7 insertions(+), 126 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 674e5950851..f202a7d49c9 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -678,7 +678,7 @@ jobs:
 #   See also https://issues.apache.org/jira/browse/SPARK-35375.
 # Pin the MarkupSafe to 2.0.1 to resolve the CI error.
 #   See also https://issues.apache.org/jira/browse/SPARK-38279.
-python3.9 -m pip install 'sphinx<3.1.0' mkdocs pydata_sphinx_theme 
nbsphinx numpydoc 'jinja2<3.0.0' 'markupsafe==2.0.1' 'pyzmq<24.0.0'
+python3.9 -m pip install 'sphinx<3.1.0' mkdocs pydata_sphinx_theme 
'sphinx-copybutton==0.5.2' nbsphinx numpydoc 'jinja2<3.0.0' 'markupsafe==2.0.1' 
'pyzmq<24.0.0'
 python3.9 -m pip install ipython_genutils # See SPARK-38517
 python3.9 -m pip install sphinx_plotly_directive 'numpy>=1.20.0' 
pyarrow pandas 'plotly>=4.8'
 python3.9 -m pip install 'docutils<0.18.0' # See SPARK-39421
diff --git a/LICENSE b/LICENSE
index 1735d3208f2..74686d7ffa3 100644
--- a/LICENSE
+++ b/LICENSE
@@ -218,11 +218,6 @@ docs/js/vendor/bootstrap.js
 
connector/spark-ganglia-lgpl/src/main/java/com/codahale/metrics/ganglia/GangliaReporter.java
 
 
-Python Software Foundation License
---
-
-python/docs/source/_static/copybutton.js
-
 BSD 3-Clause
 
 
diff --git a/dev/create-release/spark-rm/Dockerfile 
b/dev/create-release/spark-rm/Dockerfile
index 85155b67bd5..cd57226f5e0 100644
--- a/dev/create-release/spark-rm/Dockerfile
+++ b/dev/create-release/spark-rm/Dockerfile
@@ -42,7 +42,7 @@ ARG APT_INSTALL="apt-get install --no-install-recommends -y"
 #   We should use the latest Sphinx version once this is fixed.
 # TODO(SPARK-35375): Jinja2 3.0.0+ causes error when building with Sphinx.
 #   See also https://issues.apache.org/jira/browse/SPARK-35375.
-ARG PIP_PKGS="sphinx==3.0.4 mkdocs==1.1.2 numpy==1.20.3 
pydata_sphinx_theme==0.8.0 ipython==7.19.0 nbsphinx==0.8.0 numpydoc==1.1.0 
jinja2==2.11.3 twine==3.4.1 sphinx-plotly-directive==0.1.3 pandas==1.5.3 
pyarrow==3.0.0 plotly==5.4.0 markupsafe==2.0.1 docutils<0.17 grpcio==1.56.0 
protobuf==4.21.6 grpcio-status==1.56.0 googleapis-common-protos==1.56.4"
+ARG PIP_PKGS="sphinx==3.0.4 mkdocs==1.1.2 numpy==1.20.3 
pydata_sphinx_theme==0.8.0 ipython==7.19.0 nbsphinx==0.8.0 numpydoc==1.1.0 
jinja2==2.11.3 twine==3.4.1 sphinx-plotly-directive==0.1.3 
sphinx-copybutton==0.5.2 pandas==1.5.3 pyarrow==3.0.0 plotly==5.4.0 
markupsafe==2.0.1 docutils<0.17 grpcio==1.56.0 protobuf==4.21.6 
grpcio-status==1.56.0 googleapis-common-protos==1.56.4"
 ARG GEM_PKGS="bundler:2.3.8"
 
 # Install extra needed repos and refresh.
diff --git a/dev/requirements.txt b/dev/requirements.txt
index 38a9b244710..597417aba1f 100644
--- a/dev/requirements.txt
+++ b/dev/requirements.txt
@@ -37,6 +37,7 @@ numpydoc
 jinja2<3.0.0
 sphinx<3.1.0
 sphinx-plotly-directive
+sphinx-copybutton<0.5.3
 docutils<0.18.0
 # See SPARK-38279.
 markupsafe==2.0.1
diff --git a/licenses/LICENSE-copybutton.txt b/licenses/LICENSE-copybutton.txt
deleted file mode 100644
index 45be6b83a53..000
--- a/licenses/LICENSE-copybutton.txt
+++ /dev/null
@@ -1,49 +0,0 @@
-PYTHON 

(spark) branch master updated: [SPARK-45920][SQL] group by ordinal should be idempotent

2023-11-16 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new daebb5996e2 [SPARK-45920][SQL] group by ordinal should be idempotent
daebb5996e2 is described below

commit daebb5996e20e831220b9a9cd69fb4cd23e53c7e
Author: Wenchen Fan 
AuthorDate: Thu Nov 16 00:45:04 2023 -0800

[SPARK-45920][SQL] group by ordinal should be idempotent

### What changes were proposed in this pull request?

GROUP BY ordinal is not idempotent today. If the ordinal points to another 
integer literal and the plan get analyzed again, we will re-do the ordinal 
resolution which can lead to wrong result or index out-of-bound error. This PR 
fixes it by using a hack: if the ordinal points to another integer literal, 
don't replace the ordinal.

### Why are the changes needed?

For advanced users or Spark plugins, they may manipulate the logical plans 
directly. We need to make the framework more reliable.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #43797 from cloud-fan/group.

Authored-by: Wenchen Fan 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/catalyst/analysis/Analyzer.scala | 14 -
 .../SubstituteUnresolvedOrdinalsSuite.scala| 23 --
 .../analyzer-results/group-by-ordinal.sql.out  |  2 +-
 3 files changed, 35 insertions(+), 4 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
index 780edf5d8af..14c8b740f68 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@@ -2002,7 +2002,19 @@ class Analyzer(override val catalogManager: 
CatalogManager) extends RuleExecutor
   throw 
QueryCompilationErrors.groupByPositionRefersToAggregateFunctionError(
 index, ordinalExpr)
 } else {
-  ordinalExpr
+  trimAliases(ordinalExpr) match {
+// HACK ALERT: If the ordinal expression is also an integer 
literal, don't use it
+// but still keep the ordinal literal. The reason 
is we may repeatedly
+// analyze the plan. Using a different integer 
literal may lead to
+// a repeat GROUP BY ordinal resolution which is 
wrong. GROUP BY
+// constant is meaningless so whatever value does 
not matter here.
+// TODO: (SPARK-45932) GROUP BY ordinal should pull out 
grouping expressions to
+//   a Project, then the resolved ordinal expression is 
always
+//   `AttributeReference`.
+case Literal(_: Int, IntegerType) =>
+  Literal(index)
+case _ => ordinalExpr
+  }
 }
   } else {
 throw QueryCompilationErrors.groupByPositionRangeError(index, 
aggs.size)
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/SubstituteUnresolvedOrdinalsSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/SubstituteUnresolvedOrdinalsSuite.scala
index b0d7ace646e..953b2c8bb10 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/SubstituteUnresolvedOrdinalsSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/SubstituteUnresolvedOrdinalsSuite.scala
@@ -17,10 +17,11 @@
 
 package org.apache.spark.sql.catalyst.analysis
 
-import org.apache.spark.sql.catalyst.analysis.TestRelations.testRelation2
+import org.apache.spark.sql.catalyst.analysis.TestRelations.{testRelation, 
testRelation2}
 import org.apache.spark.sql.catalyst.dsl.expressions._
 import org.apache.spark.sql.catalyst.dsl.plans._
-import org.apache.spark.sql.catalyst.expressions.Literal
+import org.apache.spark.sql.catalyst.expressions.{GenericInternalRow, Literal}
+import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
 import org.apache.spark.sql.internal.SQLConf
 
 class SubstituteUnresolvedOrdinalsSuite extends AnalysisTest {
@@ -67,4 +68,22 @@ class SubstituteUnresolvedOrdinalsSuite extends AnalysisTest 
{
 testRelation2.groupBy(Literal(1), Literal(2))($"a", $"b"))
 }
   }
+
+  test("SPARK-45920: group by ordinal repeated analysis") {
+val plan = testRelation.groupBy(Literal(1))(Literal(100).as("a")).analyze
+comparePlans(
+  plan,
+  testRelation.groupBy(Literal(1))(Literal(100).as("a"))
+   

(spark) branch master updated: [SPARK-45948][K8S] Make single-pod spark jobs respect `spark.app.id`

2023-11-16 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 17f84358fbf [SPARK-45948][K8S] Make single-pod spark jobs respect 
`spark.app.id`
17f84358fbf is described below

commit 17f84358fbfb39ab048862e92ac0562fe7443ca1
Author: Dongjoon Hyun 
AuthorDate: Thu Nov 16 00:31:39 2023 -0800

[SPARK-45948][K8S] Make single-pod spark jobs respect `spark.app.id`

### What changes were proposed in this pull request?

This PR aims to make single-pod Spark jobs respect `spark.app.id` in K8s 
environment.

### Why are the changes needed?

Since Apache Spark 3.4.0, SPARK-42190 allows users to run single-pod Spark 
jobs in K8s environment by utilizing `LocalSchedulerBackend` in the driver pod. 
However, `LocalSchedulerBackend` doesn't respect `spark.app.id` while 
`KubernetesClusterSchedulerBackend` does. This PR aims to improve K8s UX by 
reducing the behavior difference between single-pod Spark jobs and multi-pod 
Spark jobs in K8s environment.

### Does this PR introduce _any_ user-facing change?

Yes, but it's more consistent with the existing general K8s jobs.

### How was this patch tested?

Pass the CIs with the newly added test case.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43833 from dongjoon-hyun/SPARK-45948.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../cluster/k8s/KubernetesClusterManager.scala  |  5 -
 .../cluster/k8s/KubernetesClusterManagerSuite.scala | 21 +
 2 files changed, 25 insertions(+), 1 deletion(-)

diff --git 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterManager.scala
 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterManager.scala
index ec5cce239ef..3235d922204 100644
--- 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterManager.scala
+++ 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterManager.scala
@@ -63,7 +63,10 @@ private[spark] class KubernetesClusterManager extends 
ExternalClusterManager wit
   }
   logInfo(s"Running Spark with 
${sc.conf.get(KUBERNETES_DRIVER_MASTER_URL)}")
   val schedulerImpl = scheduler.asInstanceOf[TaskSchedulerImpl]
-  val backend = new LocalSchedulerBackend(sc.conf, schedulerImpl, 
threadCount)
+  // KubernetesClusterSchedulerBackend respects `spark.app.id` while 
LocalSchedulerBackend
+  // does not. Propagate `spark.app.id` via `spark.test.appId` to match 
the behavior.
+  val conf = 
sc.conf.getOption("spark.app.id").map(sc.conf.set("spark.test.appId", _))
+  val backend = new LocalSchedulerBackend(conf.getOrElse(sc.conf), 
schedulerImpl, threadCount)
   schedulerImpl.initialize(backend)
   return backend
 }
diff --git 
a/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterManagerSuite.scala
 
b/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterManagerSuite.scala
index 8f999a4cfe8..07410b6a7b7 100644
--- 
a/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterManagerSuite.scala
+++ 
b/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterManagerSuite.scala
@@ -20,10 +20,13 @@ import io.fabric8.kubernetes.client.KubernetesClient
 import org.mockito.{Mock, MockitoAnnotations}
 import org.mockito.Mockito.when
 import org.scalatest.BeforeAndAfter
+import org.scalatestplus.mockito.MockitoSugar.mock
 
 import org.apache.spark._
 import org.apache.spark.deploy.k8s.Config._
 import org.apache.spark.internal.config._
+import org.apache.spark.scheduler.TaskSchedulerImpl
+import org.apache.spark.scheduler.local.LocalSchedulerBackend
 
 class KubernetesClusterManagerSuite extends SparkFunSuite with BeforeAndAfter {
 
@@ -59,4 +62,22 @@ class KubernetesClusterManagerSuite extends SparkFunSuite 
with BeforeAndAfter {
   manager.makeExecutorPodsAllocator(sc, kubernetesClient, null)
 }
   }
+
+  test("SPARK-45948: Single-pod Spark jobs respect spark.app.id") {
+val conf = new SparkConf()
+conf.set(KUBERNETES_DRIVER_MASTER_URL, "local[2]")
+when(sc.conf).thenReturn(conf)
+val scheduler = mock[TaskSchedulerImpl]
+when(scheduler.sc).thenReturn(sc)
+val manager = new KubernetesClusterManager()
+
+val backend1 = manager.createSchedulerBackend(sc, "", scheduler)
+assert(backend1.isInstanceOf[LocalSchedulerBackend])
+assert(backend1.applicationId().startsWith("local-"))

(spark) branch master updated: [SPARK-45919][CORE][SQL] Use Java 16 `record` to simplify Java class definition

2023-11-16 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 30c9c8dd9fe [SPARK-45919][CORE][SQL] Use Java 16 `record` to simplify 
Java class definition
30c9c8dd9fe is described below

commit 30c9c8dd9fe03eaa85ecf192c977e7645987c653
Author: yangjie01 
AuthorDate: Wed Nov 15 23:59:17 2023 -0800

[SPARK-45919][CORE][SQL] Use Java 16 `record` to simplify Java class 
definition

### What changes were proposed in this pull request?
This pr uses the `record` keyword introduced by [JEP 
395](https://openjdk.org/jeps/395) to simplify Java class definition.

### Why are the changes needed?
Using the new feature introduced in Java 16 to simplify Java class 
definition.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43796 from LuciferYang/class-2-record.

Lead-authored-by: yangjie01 
Co-authored-by: YangJie 
Signed-off-by: Dongjoon Hyun 
---
 .../network/client/TransportResponseHandler.java |  6 +++---
 .../org/apache/spark/network/crypto/AuthEngine.java  | 14 +++---
 .../org/apache/spark/network/crypto/AuthMessage.java | 12 +---
 .../apache/spark/network/crypto/AuthRpcHandler.java  | 10 +-
 .../apache/spark/network/protocol/StreamChunkId.java |  9 +
 .../network/server/ChunkFetchRequestHandler.java |  8 
 .../apache/spark/network/RpcIntegrationSuite.java| 13 ++---
 .../apache/spark/network/crypto/AuthEngineSuite.java | 12 ++--
 .../spark/network/crypto/AuthMessagesSuite.java  |  6 +++---
 .../shuffle/ExternalShuffleBlockResolver.java|  4 ++--
 .../network/shuffle/RemoteBlockPushResolver.java |  2 +-
 .../spark/network/shuffle/ShuffleIndexRecord.java| 18 +-
 .../network/shuffle/ShuffleTransportContext.java | 10 +-
 .../shuffle/ShuffleIndexInformationSuite.java|  8 
 .../shuffle/ShuffleTransportContextSuite.java|  2 +-
 .../network/yarn/YarnShuffleServiceMetrics.java  | 20 +---
 .../connector/expressions/aggregate/Aggregation.java | 15 +++
 .../datasources/parquet/ParquetReadState.java|  9 +
 18 files changed, 47 insertions(+), 131 deletions(-)

diff --git 
a/common/network-common/src/main/java/org/apache/spark/network/client/TransportResponseHandler.java
 
b/common/network-common/src/main/java/org/apache/spark/network/client/TransportResponseHandler.java
index a19767ae201..cf9af2e00c8 100644
--- 
a/common/network-common/src/main/java/org/apache/spark/network/client/TransportResponseHandler.java
+++ 
b/common/network-common/src/main/java/org/apache/spark/network/client/TransportResponseHandler.java
@@ -108,7 +108,7 @@ public class TransportResponseHandler extends 
MessageHandler {
   private void failOutstandingRequests(Throwable cause) {
 for (Map.Entry entry : 
outstandingFetches.entrySet()) {
   try {
-entry.getValue().onFailure(entry.getKey().chunkIndex, cause);
+entry.getValue().onFailure(entry.getKey().chunkIndex(), cause);
   } catch (Exception e) {
 logger.warn("ChunkReceivedCallback.onFailure throws exception", e);
   }
@@ -169,7 +169,7 @@ public class TransportResponseHandler extends 
MessageHandler {
 resp.body().release();
   } else {
 outstandingFetches.remove(resp.streamChunkId);
-listener.onSuccess(resp.streamChunkId.chunkIndex, resp.body());
+listener.onSuccess(resp.streamChunkId.chunkIndex(), resp.body());
 resp.body().release();
   }
 } else if (message instanceof ChunkFetchFailure) {
@@ -180,7 +180,7 @@ public class TransportResponseHandler extends 
MessageHandler {
   resp.streamChunkId, getRemoteAddress(channel), resp.errorString);
   } else {
 outstandingFetches.remove(resp.streamChunkId);
-listener.onFailure(resp.streamChunkId.chunkIndex, new 
ChunkFetchFailureException(
+listener.onFailure(resp.streamChunkId.chunkIndex(), new 
ChunkFetchFailureException(
   "Failure while fetching " + resp.streamChunkId + ": " + 
resp.errorString));
   }
 } else if (message instanceof RpcResponse) {
diff --git 
a/common/network-common/src/main/java/org/apache/spark/network/crypto/AuthEngine.java
 
b/common/network-common/src/main/java/org/apache/spark/network/crypto/AuthEngine.java
index 078d9ceb317..7ca4bc40a86 100644
--- 
a/common/network-common/src/main/java/org/apache/spark/network/crypto/AuthEngine.java
+++ 
b/common/network-common/src/main/java/org/apache/spark/network/crypto/AuthEngine.java
@@ -118,20 +118,20 @@ class AuthEngine implements Closeable {
   

(spark) branch master updated: [SPARK-45949][INFRA] Upgrade `pyarrow` to 14

2023-11-16 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 352c55178e5 [SPARK-45949][INFRA] Upgrade `pyarrow` to 14
352c55178e5 is described below

commit 352c55178e51d0008bcb96f089623ecd94743841
Author: Ruifeng Zheng 
AuthorDate: Wed Nov 15 23:58:01 2023 -0800

[SPARK-45949][INFRA] Upgrade `pyarrow` to 14

### What changes were proposed in this pull request?
Upgrade `pyarrow` to 14

### Why are the changes needed?
test with the latest version of `pyarrow`

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #43829 from zhengruifeng/infra_pyarrow_14.

Authored-by: Ruifeng Zheng 
Signed-off-by: Dongjoon Hyun 
---
 dev/infra/Dockerfile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/dev/infra/Dockerfile b/dev/infra/Dockerfile
index b433faa14c8..8d12f00a034 100644
--- a/dev/infra/Dockerfile
+++ b/dev/infra/Dockerfile
@@ -85,7 +85,7 @@ RUN Rscript -e "devtools::install_version('roxygen2', 
version='7.2.0', repos='ht
 ENV R_LIBS_SITE 
"/usr/local/lib/R/site-library:${R_LIBS_SITE}:/usr/lib/R/library"
 
 RUN pypy3 -m pip install numpy 'pandas<=2.1.3' scipy coverage matplotlib
-RUN python3.9 -m pip install numpy pyarrow 'pandas<=2.1.3' scipy 
unittest-xml-reporting plotly>=4.8 'mlflow>=2.3.1' coverage matplotlib openpyxl 
'memory-profiler==0.60.0' 'scikit-learn==1.1.*'
+RUN python3.9 -m pip install numpy 'pyarrow>=14.0.0' 'pandas<=2.1.3' scipy 
unittest-xml-reporting plotly>=4.8 'mlflow>=2.3.1' coverage matplotlib openpyxl 
'memory-profiler==0.60.0' 'scikit-learn==1.1.*'
 
 # Add Python deps for Spark Connect.
 RUN python3.9 -m pip install 'grpcio>=1.48,<1.57' 'grpcio-status>=1.48,<1.57' 
'protobuf==3.20.3' 'googleapis-common-protos==1.56.4'


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org