date:20230220

[spark-docker] branch master updated: [SPARK-42494] Add official image Dockerfile for Spark v3.3.2

2023-02-20 Thread yikun

This is an automated email from the ASF dual-hosted git repository.

yikun pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark-docker.git


The following commit(s) were added to refs/heads/master by this push:
 new e8f5b0a  [SPARK-42494] Add official image Dockerfile for Spark v3.3.2
e8f5b0a is described below

commit e8f5b0a1151c349d9c7fdb09cf76300b42a6946b
Author: Yikun Jiang 
AuthorDate: Tue Feb 21 14:22:19 2023 +0800

[SPARK-42494] Add official image Dockerfile for Spark v3.3.2

### What changes were proposed in this pull request?
Add Apache Spark 3.3.2 Dockerfiles.
- Add 3.3.2 GPG key
- Add .github/workflows/build_3.3.2.yaml
- ./add-dockerfiles.sh 3.3.2

### Why are the changes needed?
Apache Spark 3.3.2 released.

https://lists.apache.org/thread/k8skf16wyn6rg9n0vd0t6l3bhw7c9svq

### Does this PR introduce _any_ user-facing change?
Yes in future, new image will publised in future (after DOI reviewed)

### How was this patch tested?
Add workflow and CI passed

Closes #30 from Yikun/SPARK-42494.

Authored-by: Yikun Jiang 
Signed-off-by: Yikun Jiang 
---
 .github/workflows/build_3.3.2.yaml | 43 +++
 3.3.2/scala2.12-java11-python3-r-ubuntu/Dockerfile | 86 ++
 .../entrypoint.sh  |  0
 3.3.2/scala2.12-java11-python3-ubuntu/Dockerfile   | 83 +
 .../scala2.12-java11-python3-ubuntu/entrypoint.sh  |  0
 3.3.2/scala2.12-java11-r-ubuntu/Dockerfile | 82 +
 .../scala2.12-java11-r-ubuntu/entrypoint.sh|  7 --
 3.3.2/scala2.12-java11-ubuntu/Dockerfile   | 79 
 .../scala2.12-java11-ubuntu/entrypoint.sh  |  7 --
 add-dockerfiles.sh |  2 +-
 entrypoint.sh.template |  2 +
 tools/template.py  |  2 +
 12 files changed, 378 insertions(+), 15 deletions(-)

diff --git a/.github/workflows/build_3.3.2.yaml 
b/.github/workflows/build_3.3.2.yaml
new file mode 100644
index 000..9ae1a13
--- /dev/null
+++ b/.github/workflows/build_3.3.2.yaml
@@ -0,0 +1,43 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+
+name: "Build and Test (3.3.2)"
+
+on:
+  pull_request:
+branches:
+  - 'master'
+paths:
+  - '3.3.2/**'
+  - '.github/workflows/build_3.3.2.yaml'
+  - '.github/workflows/main.yml'
+
+jobs:
+  run-build:
+strategy:
+  matrix:
+image-type: ["all", "python", "scala", "r"]
+name: Run
+secrets: inherit
+uses: ./.github/workflows/main.yml
+with:
+  spark: 3.3.2
+  scala: 2.12
+  java: 11
+  image-type: ${{ matrix.image-type }}
diff --git a/3.3.2/scala2.12-java11-python3-r-ubuntu/Dockerfile 
b/3.3.2/scala2.12-java11-python3-r-ubuntu/Dockerfile
new file mode 100644
index 000..b518021
--- /dev/null
+++ b/3.3.2/scala2.12-java11-python3-r-ubuntu/Dockerfile
@@ -0,0 +1,86 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+FROM eclipse-temurin:11-jre-focal
+
+ARG spark_uid=185
+
+RUN groupadd --system --gid=${spark_uid} spark && \
+useradd --system --uid=${spark_uid} --gid=spark spark
+
+RUN set -ex && \
+apt-get update && \
+ln -s /lib /lib64 && \
+apt install -y gnupg2 wget bash tini libc6 libpam-modules krb5-user 
libnss3 procps net-tools gosu && \
+apt install -y python3 py

[spark] 01/01: Preparing development version 3.4.1-SNAPSHOT

2023-02-20 Thread xinrong

This is an automated email from the ASF dual-hosted git repository.

xinrong pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git

commit 1dfa58d78eba7080a244945c23f7b35b62dde12b
Author: Xinrong Meng 
AuthorDate: Tue Feb 21 02:43:10 2023 +

Preparing development version 3.4.1-SNAPSHOT
---
 R/pkg/DESCRIPTION  | 2 +-
 assembly/pom.xml   | 2 +-
 common/kvstore/pom.xml | 2 +-
 common/network-common/pom.xml  | 2 +-
 common/network-shuffle/pom.xml | 2 +-
 common/network-yarn/pom.xml| 2 +-
 common/sketch/pom.xml  | 2 +-
 common/tags/pom.xml| 2 +-
 common/unsafe/pom.xml  | 2 +-
 connector/avro/pom.xml | 2 +-
 connector/connect/client/jvm/pom.xml   | 2 +-
 connector/connect/common/pom.xml   | 2 +-
 connector/connect/server/pom.xml   | 2 +-
 connector/docker-integration-tests/pom.xml | 2 +-
 connector/kafka-0-10-assembly/pom.xml  | 2 +-
 connector/kafka-0-10-sql/pom.xml   | 2 +-
 connector/kafka-0-10-token-provider/pom.xml| 2 +-
 connector/kafka-0-10/pom.xml   | 2 +-
 connector/kinesis-asl-assembly/pom.xml | 2 +-
 connector/kinesis-asl/pom.xml  | 2 +-
 connector/protobuf/pom.xml | 2 +-
 connector/spark-ganglia-lgpl/pom.xml   | 2 +-
 core/pom.xml   | 2 +-
 docs/_config.yml   | 6 +++---
 examples/pom.xml   | 2 +-
 graphx/pom.xml | 2 +-
 hadoop-cloud/pom.xml   | 2 +-
 launcher/pom.xml   | 2 +-
 mllib-local/pom.xml| 2 +-
 mllib/pom.xml  | 2 +-
 pom.xml| 2 +-
 python/pyspark/version.py  | 2 +-
 repl/pom.xml   | 2 +-
 resource-managers/kubernetes/core/pom.xml  | 2 +-
 resource-managers/kubernetes/integration-tests/pom.xml | 2 +-
 resource-managers/mesos/pom.xml| 2 +-
 resource-managers/yarn/pom.xml | 2 +-
 sql/catalyst/pom.xml   | 2 +-
 sql/core/pom.xml   | 2 +-
 sql/hive-thriftserver/pom.xml  | 2 +-
 sql/hive/pom.xml   | 2 +-
 streaming/pom.xml  | 2 +-
 tools/pom.xml  | 2 +-
 43 files changed, 45 insertions(+), 45 deletions(-)

diff --git a/R/pkg/DESCRIPTION b/R/pkg/DESCRIPTION
index 4a32762b34c..fa7028630a8 100644
--- a/R/pkg/DESCRIPTION
+++ b/R/pkg/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: SparkR
 Type: Package
-Version: 3.4.0
+Version: 3.4.1
 Title: R Front End for 'Apache Spark'
 Description: Provides an R Front end for 'Apache Spark' 
.
 Authors@R:
diff --git a/assembly/pom.xml b/assembly/pom.xml
index 58dd9ef46e0..a4111eb64d9 100644
--- a/assembly/pom.xml
+++ b/assembly/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.12
-3.4.0
+3.4.1-SNAPSHOT
 ../pom.xml
   
 
diff --git a/common/kvstore/pom.xml b/common/kvstore/pom.xml
index 95ea15552da..f9ecfb3d692 100644
--- a/common/kvstore/pom.xml
+++ b/common/kvstore/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.12
-3.4.0
+3.4.1-SNAPSHOT
 ../../pom.xml
   
 
diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml
index e4d98471bf9..22ee65b7d25 100644
--- a/common/network-common/pom.xml
+++ b/common/network-common/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.12
-3.4.0
+3.4.1-SNAPSHOT
 ../../pom.xml
   
 
diff --git a/common/network-shuffle/pom.xml b/common/network-shuffle/pom.xml
index 7a6d5aedf65..2c67da81ca4 100644
--- a/common/network-shuffle/pom.xml
+++ b/common/network-shuffle/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.12
-3.4.0
+3.4.1-SNAPSHOT
 ../../pom.xml
   
 
diff --git a/common/network-yarn/pom.xml b/common/network-yarn/pom.xml
index 1c421754083..219682e047d 100644
--- a/common/network-yarn/pom.xml
+++ b/common/network-yarn/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.12
-3.4.0
+3.4.1-SNAPSHOT
 ../../pom.xml
   
 
diff --git a/common/sketch/pom.xml b/common/sketch/pom.xml
index 2ee25ebfffc..22ce7

[spark] branch branch-3.4 updated (4560d4c4f75 -> 1dfa58d78eb)

2023-02-20 Thread xinrong

This is an automated email from the ASF dual-hosted git repository.

xinrong pushed a change to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


from 4560d4c4f75 [SPARK-41952][SQL] Fix Parquet zstd off-heap memory leak 
as a workaround for PARQUET-2160
 add 81d39dcf742 Preparing Spark release v3.4.0-rc1
 new 1dfa58d78eb Preparing development version 3.4.1-SNAPSHOT

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] tag v3.4.0-rc1 created (now 81d39dcf742)

2023-02-20 Thread xinrong

This is an automated email from the ASF dual-hosted git repository.

xinrong pushed a change to tag v3.4.0-rc1
in repository https://gitbox.apache.org/repos/asf/spark.git


  at 81d39dcf742 (commit)
This tag includes the following new commits:

 new 81d39dcf742 Preparing Spark release v3.4.0-rc1

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.



-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] 01/01: Preparing Spark release v3.4.0-rc1

2023-02-20 Thread xinrong

This is an automated email from the ASF dual-hosted git repository.

xinrong pushed a commit to tag v3.4.0-rc1
in repository https://gitbox.apache.org/repos/asf/spark.git

commit 81d39dcf742ed7114d6e01ecc2487825651e30cb
Author: Xinrong Meng 
AuthorDate: Tue Feb 21 02:43:05 2023 +

Preparing Spark release v3.4.0-rc1
---
 R/pkg/DESCRIPTION  | 2 +-
 assembly/pom.xml   | 2 +-
 common/kvstore/pom.xml | 2 +-
 common/network-common/pom.xml  | 2 +-
 common/network-shuffle/pom.xml | 2 +-
 common/network-yarn/pom.xml| 2 +-
 common/sketch/pom.xml  | 2 +-
 common/tags/pom.xml| 2 +-
 common/unsafe/pom.xml  | 2 +-
 connector/avro/pom.xml | 2 +-
 connector/connect/client/jvm/pom.xml   | 2 +-
 connector/connect/common/pom.xml   | 2 +-
 connector/connect/server/pom.xml   | 2 +-
 connector/docker-integration-tests/pom.xml | 2 +-
 connector/kafka-0-10-assembly/pom.xml  | 2 +-
 connector/kafka-0-10-sql/pom.xml   | 2 +-
 connector/kafka-0-10-token-provider/pom.xml| 2 +-
 connector/kafka-0-10/pom.xml   | 2 +-
 connector/kinesis-asl-assembly/pom.xml | 2 +-
 connector/kinesis-asl/pom.xml  | 2 +-
 connector/protobuf/pom.xml | 2 +-
 connector/spark-ganglia-lgpl/pom.xml   | 2 +-
 core/pom.xml   | 2 +-
 docs/_config.yml   | 6 +++---
 examples/pom.xml   | 2 +-
 graphx/pom.xml | 2 +-
 hadoop-cloud/pom.xml   | 2 +-
 launcher/pom.xml   | 2 +-
 mllib-local/pom.xml| 2 +-
 mllib/pom.xml  | 2 +-
 pom.xml| 2 +-
 python/pyspark/version.py  | 2 +-
 repl/pom.xml   | 2 +-
 resource-managers/kubernetes/core/pom.xml  | 2 +-
 resource-managers/kubernetes/integration-tests/pom.xml | 2 +-
 resource-managers/mesos/pom.xml| 2 +-
 resource-managers/yarn/pom.xml | 2 +-
 sql/catalyst/pom.xml   | 2 +-
 sql/core/pom.xml   | 2 +-
 sql/hive-thriftserver/pom.xml  | 2 +-
 sql/hive/pom.xml   | 2 +-
 streaming/pom.xml  | 2 +-
 tools/pom.xml  | 2 +-
 43 files changed, 45 insertions(+), 45 deletions(-)

diff --git a/R/pkg/DESCRIPTION b/R/pkg/DESCRIPTION
index fa7028630a8..4a32762b34c 100644
--- a/R/pkg/DESCRIPTION
+++ b/R/pkg/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: SparkR
 Type: Package
-Version: 3.4.1
+Version: 3.4.0
 Title: R Front End for 'Apache Spark'
 Description: Provides an R Front end for 'Apache Spark' 
.
 Authors@R:
diff --git a/assembly/pom.xml b/assembly/pom.xml
index a4111eb64d9..58dd9ef46e0 100644
--- a/assembly/pom.xml
+++ b/assembly/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.12
-3.4.1-SNAPSHOT
+3.4.0
 ../pom.xml
   
 
diff --git a/common/kvstore/pom.xml b/common/kvstore/pom.xml
index f9ecfb3d692..95ea15552da 100644
--- a/common/kvstore/pom.xml
+++ b/common/kvstore/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.12
-3.4.1-SNAPSHOT
+3.4.0
 ../../pom.xml
   
 
diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml
index 22ee65b7d25..e4d98471bf9 100644
--- a/common/network-common/pom.xml
+++ b/common/network-common/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.12
-3.4.1-SNAPSHOT
+3.4.0
 ../../pom.xml
   
 
diff --git a/common/network-shuffle/pom.xml b/common/network-shuffle/pom.xml
index 2c67da81ca4..7a6d5aedf65 100644
--- a/common/network-shuffle/pom.xml
+++ b/common/network-shuffle/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.12
-3.4.1-SNAPSHOT
+3.4.0
 ../../pom.xml
   
 
diff --git a/common/network-yarn/pom.xml b/common/network-yarn/pom.xml
index 219682e047d..1c421754083 100644
--- a/common/network-yarn/pom.xml
+++ b/common/network-yarn/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.12
-3.4.1-SNAPSHOT
+3.4.0
 ../../pom.xml
   
 
diff --git a/common/sketch/pom.xml b/common/sketch/pom.xml
index 22ce78c6fd2..2ee25ebfffc 100644

svn commit: r60229 - /dev/spark/v3.4.0-rc1-bin/

2023-02-20 Thread xinrong

Author: xinrong
Date: Tue Feb 21 00:44:12 2023
New Revision: 60229

Log:
Removing RC artifacts.

Removed:
dev/spark/v3.4.0-rc1-bin/


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] dongjoon-hyun closed pull request #441: replace docs/latest/">Latest Release (Spark 3.3.2) use jekyll

2023-02-20 Thread via GitHub



dongjoon-hyun closed pull request #441: replace docs/latest/">Latest Release 
(Spark 3.3.2) use jekyll
URL: https://github.com/apache/spark-website/pull/441


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] bjornjorgensen closed pull request #440: replace `docs/latest/">Latest Release (Spark 3.3.2)`

2023-02-20 Thread via GitHub



bjornjorgensen closed pull request #440: replace `docs/latest/">Latest Release 
(Spark 3.3.2)`
URL: https://github.com/apache/spark-website/pull/440


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] bjornjorgensen commented on pull request #440: replace `docs/latest/">Latest Release (Spark 3.3.2)`

2023-02-20 Thread via GitHub



bjornjorgensen commented on PR #440:
URL: https://github.com/apache/spark-website/pull/440#issuecomment-1437517506

   OK, new PR so I close this one. 
   Thanks. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] bjornjorgensen opened a new pull request, #441: use jekyll

2023-02-20 Thread via GitHub



bjornjorgensen opened a new pull request, #441:
URL: https://github.com/apache/spark-website/pull/441

   
   
   
   
   Regenerating: 1 file(s) changed at 2023-02-20 21:24:22
   _layouts/global.html
   ...done in 3.91438375 seconds.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] dongjoon-hyun commented on pull request #440: replace `docs/latest/">Latest Release (Spark 3.3.2)`

2023-02-20 Thread via GitHub



dongjoon-hyun commented on PR #440:
URL: https://github.com/apache/spark-website/pull/440#issuecomment-1437484502

   Got it, but you should not do like that, @bjornjorgensen .
   > Find and replace in vs code
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] bjornjorgensen commented on pull request #440: replace `docs/latest/">Latest Release (Spark 3.3.2)`

2023-02-20 Thread via GitHub



bjornjorgensen commented on PR #440:
URL: https://github.com/apache/spark-website/pull/440#issuecomment-1437483014

   Find and replace in vs code


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] bjornjorgensen commented on pull request #440: replace `docs/latest/">Latest Release (Spark 3.3.2)`

2023-02-20 Thread via GitHub



bjornjorgensen commented on PR #440:
URL: https://github.com/apache/spark-website/pull/440#issuecomment-1437453786

   The last wos for one `Latest Release (Spark 3.3.1)` to `Latest Release 
(Spark 3.3.2)` 
   Then after the vote I asked if it wos ok with the one.. and it wos just 
merged. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] bjornjorgensen commented on pull request #439: Update doc. version on index site.

2023-02-20 Thread via GitHub



bjornjorgensen commented on PR #439:
URL: https://github.com/apache/spark-website/pull/439#issuecomment-1437450626

   @dongjoon-hyun I open a new PR. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] bjornjorgensen opened a new pull request, #440: replace docs/latest/">Latest Release (Spark 3.3.2)

2023-02-20 Thread via GitHub



bjornjorgensen opened a new pull request, #440:
URL: https://github.com/apache/spark-website/pull/440

   
   
   
   
   
   replace docs/latest/">Latest Release (Spark 3.3.2) -> replace 
docs/latest/">Latest Release


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] dongjoon-hyun commented on pull request #439: Update doc. version on index site.

2023-02-20 Thread via GitHub



dongjoon-hyun commented on PR #439:
URL: https://github.com/apache/spark-website/pull/439#issuecomment-1437445583

   What do you mean, @bjornjorgensen ?
   > yes, but should I do the 443 others to?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] bjornjorgensen commented on pull request #439: Update doc. version on index site.

2023-02-20 Thread via GitHub



bjornjorgensen commented on PR #439:
URL: https://github.com/apache/spark-website/pull/439#issuecomment-1437433700

   yes, but should I do the 443 others to? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] viirya commented on pull request #439: Update doc. version on index site.

2023-02-20 Thread via GitHub



viirya commented on PR #439:
URL: https://github.com/apache/spark-website/pull/439#issuecomment-1437374882

   Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r60226 - /dev/spark/v3.2.3-rc1-docs/

2023-02-20 Thread dongjoon

Author: dongjoon
Date: Mon Feb 20 17:49:59 2023
New Revision: 60226

Log:
Remove after releasing

Removed:
dev/spark/v3.2.3-rc1-docs/


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] dongjoon-hyun commented on pull request #439: Update doc. version on index site.

2023-02-20 Thread via GitHub



dongjoon-hyun commented on PR #439:
URL: https://github.com/apache/spark-website/pull/439#issuecomment-1437367917

   Thank you all!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.4 updated: [SPARK-41952][SQL] Fix Parquet zstd off-heap memory leak as a workaround for PARQUET-2160

2023-02-20 Thread sunchao

This is an automated email from the ASF dual-hosted git repository.

sunchao pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 4560d4c4f75 [SPARK-41952][SQL] Fix Parquet zstd off-heap memory leak 
as a workaround for PARQUET-2160
4560d4c4f75 is described below

commit 4560d4c4f758abb87631195f60d9c293e3d5b2a1
Author: Cheng Pan 
AuthorDate: Mon Feb 20 09:40:44 2023 -0800

[SPARK-41952][SQL] Fix Parquet zstd off-heap memory leak as a workaround 
for PARQUET-2160

### What changes were proposed in this pull request?

SPARK-41952 was raised for a while, but unfortunately, the Parquet 
community does not publish the patched version yet, as a workaround, we can fix 
the issue on the Spark side first.

We encountered this memory issue when migrating data from parquet/snappy to 
parquet/zstd, Spark executors always occupy unreasonable off-heap memory and 
have a high risk of being killed by NM.

See more discussions at https://github.com/apache/parquet-mr/pull/982 and 
https://github.com/apache/iceberg/pull/5681

### Why are the changes needed?

The issue is fixed in the parquet community 
[PARQUET-2160](https://issues.apache.org/jira/browse/PARQUET-2160), but the 
patched version is not available yet.

### Does this PR introduce _any_ user-facing change?

Yes, it's bug fix.

### How was this patch tested?

The existing UT should cover the correctness check, I also verified this 
patch by scanning a large parquet/zstd table.

```
spark-shell --executor-cores 4 --executor-memory 6g --conf 
spark.executor.memoryOverhead=2g
```

```
spark.sql("select sum(hash(*)) from parquet_zstd_table ").show(false)
```

- before this patch

All executors get killed by NM quickly.
```
ERROR YarnScheduler: Lost executor 1 on hadoop-..org: Container 
killed by YARN for exceeding physical memory limits. 8.2 GB of 8 GB physical 
memory used. Consider boosting spark.executor.memoryOverhead.
```
https://user-images.githubusercontent.com/26535726/220031678-e9060244-5586-4f0c-8fe7-55bb4e20a580.png";>

- after this patch

Query runs well, no executor gets killed.

https://user-images.githubusercontent.com/26535726/220031917-4fe38c07-b38f-49c6-a982-2091a6c2a8ed.png";>

Closes #40091 from pan3793/SPARK-41952.

Authored-by: Cheng Pan 
Signed-off-by: Chao Sun 
---
 .../datasources/parquet/ParquetCodecFactory.java   | 112 +
 .../parquet/SpecificParquetRecordReaderBase.java   |   2 +
 2 files changed, 114 insertions(+)

diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetCodecFactory.java
 
b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetCodecFactory.java
new file mode 100644
index 000..2edbdc70da2
--- /dev/null
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetCodecFactory.java
@@ -0,0 +1,112 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.compress.CodecPool;
+import org.apache.hadoop.io.compress.CompressionCodec;
+import org.apache.hadoop.io.compress.Decompressor;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.hadoop.CodecFactory;
+import org.apache.parquet.hadoop.codec.ZstandardCodec;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+
+/**
+ * This class implements a codec factory that is used when reading from 
Parquet. It adds a
+ * workaround for memory issues encountered when reading from zstd-compressed 
files. For
+ * details, see https://issues.apache.org/jira/browse/PARQUET-2160";>PARQUET-2160
+ *
+ * TODO: Remove this workaround after upgrading Parquet which include 
PARQUET-2160.
+ */
+public class ParquetCodecFactory extends CodecFactory {
+
+

[spark] branch branch-3.2 updated: [SPARK-41952][SQL] Fix Parquet zstd off-heap memory leak as a workaround for PARQUET-2160

2023-02-20 Thread sunchao

This is an automated email from the ASF dual-hosted git repository.

sunchao pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 0b84da9b59e [SPARK-41952][SQL] Fix Parquet zstd off-heap memory leak 
as a workaround for PARQUET-2160
0b84da9b59e is described below

commit 0b84da9b59e6619f1837904aa8105109bd7c45e6
Author: Cheng Pan 
AuthorDate: Mon Feb 20 09:40:44 2023 -0800

[SPARK-41952][SQL] Fix Parquet zstd off-heap memory leak as a workaround 
for PARQUET-2160

### What changes were proposed in this pull request?

SPARK-41952 was raised for a while, but unfortunately, the Parquet 
community does not publish the patched version yet, as a workaround, we can fix 
the issue on the Spark side first.

We encountered this memory issue when migrating data from parquet/snappy to 
parquet/zstd, Spark executors always occupy unreasonable off-heap memory and 
have a high risk of being killed by NM.

See more discussions at https://github.com/apache/parquet-mr/pull/982 and 
https://github.com/apache/iceberg/pull/5681

### Why are the changes needed?

The issue is fixed in the parquet community 
[PARQUET-2160](https://issues.apache.org/jira/browse/PARQUET-2160), but the 
patched version is not available yet.

### Does this PR introduce _any_ user-facing change?

Yes, it's bug fix.

### How was this patch tested?

The existing UT should cover the correctness check, I also verified this 
patch by scanning a large parquet/zstd table.

```
spark-shell --executor-cores 4 --executor-memory 6g --conf 
spark.executor.memoryOverhead=2g
```

```
spark.sql("select sum(hash(*)) from parquet_zstd_table ").show(false)
```

- before this patch

All executors get killed by NM quickly.
```
ERROR YarnScheduler: Lost executor 1 on hadoop-..org: Container 
killed by YARN for exceeding physical memory limits. 8.2 GB of 8 GB physical 
memory used. Consider boosting spark.executor.memoryOverhead.
```
https://user-images.githubusercontent.com/26535726/220031678-e9060244-5586-4f0c-8fe7-55bb4e20a580.png";>

- after this patch

Query runs well, no executor gets killed.

https://user-images.githubusercontent.com/26535726/220031917-4fe38c07-b38f-49c6-a982-2091a6c2a8ed.png";>

Closes #40091 from pan3793/SPARK-41952.

Authored-by: Cheng Pan 
Signed-off-by: Chao Sun 
---
 .../datasources/parquet/ParquetCodecFactory.java   | 112 +
 .../parquet/SpecificParquetRecordReaderBase.java   |   2 +
 2 files changed, 114 insertions(+)

diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetCodecFactory.java
 
b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetCodecFactory.java
new file mode 100644
index 000..2edbdc70da2
--- /dev/null
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetCodecFactory.java
@@ -0,0 +1,112 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.compress.CodecPool;
+import org.apache.hadoop.io.compress.CompressionCodec;
+import org.apache.hadoop.io.compress.Decompressor;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.hadoop.CodecFactory;
+import org.apache.parquet.hadoop.codec.ZstandardCodec;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+
+/**
+ * This class implements a codec factory that is used when reading from 
Parquet. It adds a
+ * workaround for memory issues encountered when reading from zstd-compressed 
files. For
+ * details, see https://issues.apache.org/jira/browse/PARQUET-2160";>PARQUET-2160
+ *
+ * TODO: Remove this workaround after upgrading Parquet which include 
PARQUET-2160.
+ */
+public class ParquetCodecFactory extends CodecFactory {
+
+

[spark] branch branch-3.3 updated: [SPARK-41952][SQL] Fix Parquet zstd off-heap memory leak as a workaround for PARQUET-2160

2023-02-20 Thread sunchao

This is an automated email from the ASF dual-hosted git repository.

sunchao pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 6da0f88e513 [SPARK-41952][SQL] Fix Parquet zstd off-heap memory leak 
as a workaround for PARQUET-2160
6da0f88e513 is described below

commit 6da0f88e513b72e52acdfae6986e20bf3d4d51a6
Author: Cheng Pan 
AuthorDate: Mon Feb 20 09:40:44 2023 -0800

[SPARK-41952][SQL] Fix Parquet zstd off-heap memory leak as a workaround 
for PARQUET-2160

### What changes were proposed in this pull request?

SPARK-41952 was raised for a while, but unfortunately, the Parquet 
community does not publish the patched version yet, as a workaround, we can fix 
the issue on the Spark side first.

We encountered this memory issue when migrating data from parquet/snappy to 
parquet/zstd, Spark executors always occupy unreasonable off-heap memory and 
have a high risk of being killed by NM.

See more discussions at https://github.com/apache/parquet-mr/pull/982 and 
https://github.com/apache/iceberg/pull/5681

### Why are the changes needed?

The issue is fixed in the parquet community 
[PARQUET-2160](https://issues.apache.org/jira/browse/PARQUET-2160), but the 
patched version is not available yet.

### Does this PR introduce _any_ user-facing change?

Yes, it's bug fix.

### How was this patch tested?

The existing UT should cover the correctness check, I also verified this 
patch by scanning a large parquet/zstd table.

```
spark-shell --executor-cores 4 --executor-memory 6g --conf 
spark.executor.memoryOverhead=2g
```

```
spark.sql("select sum(hash(*)) from parquet_zstd_table ").show(false)
```

- before this patch

All executors get killed by NM quickly.
```
ERROR YarnScheduler: Lost executor 1 on hadoop-..org: Container 
killed by YARN for exceeding physical memory limits. 8.2 GB of 8 GB physical 
memory used. Consider boosting spark.executor.memoryOverhead.
```
https://user-images.githubusercontent.com/26535726/220031678-e9060244-5586-4f0c-8fe7-55bb4e20a580.png";>

- after this patch

Query runs well, no executor gets killed.

https://user-images.githubusercontent.com/26535726/220031917-4fe38c07-b38f-49c6-a982-2091a6c2a8ed.png";>

Closes #40091 from pan3793/SPARK-41952.

Authored-by: Cheng Pan 
Signed-off-by: Chao Sun 
---
 .../datasources/parquet/ParquetCodecFactory.java   | 112 +
 .../parquet/SpecificParquetRecordReaderBase.java   |   2 +
 2 files changed, 114 insertions(+)

diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetCodecFactory.java
 
b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetCodecFactory.java
new file mode 100644
index 000..2edbdc70da2
--- /dev/null
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetCodecFactory.java
@@ -0,0 +1,112 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.compress.CodecPool;
+import org.apache.hadoop.io.compress.CompressionCodec;
+import org.apache.hadoop.io.compress.Decompressor;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.hadoop.CodecFactory;
+import org.apache.parquet.hadoop.codec.ZstandardCodec;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+
+/**
+ * This class implements a codec factory that is used when reading from 
Parquet. It adds a
+ * workaround for memory issues encountered when reading from zstd-compressed 
files. For
+ * details, see https://issues.apache.org/jira/browse/PARQUET-2160";>PARQUET-2160
+ *
+ * TODO: Remove this workaround after upgrading Parquet which include 
PARQUET-2160.
+ */
+public class ParquetCodecFactory extends CodecFactory {
+
+

[spark] branch master updated: [SPARK-41952][SQL] Fix Parquet zstd off-heap memory leak as a workaround for PARQUET-2160

2023-02-20 Thread sunchao

This is an automated email from the ASF dual-hosted git repository.

sunchao pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 1688a8768fb [SPARK-41952][SQL] Fix Parquet zstd off-heap memory leak 
as a workaround for PARQUET-2160
1688a8768fb is described below

commit 1688a8768fb34060548f8790e77f645027f65db2
Author: Cheng Pan 
AuthorDate: Mon Feb 20 09:40:44 2023 -0800

[SPARK-41952][SQL] Fix Parquet zstd off-heap memory leak as a workaround 
for PARQUET-2160

### What changes were proposed in this pull request?

SPARK-41952 was raised for a while, but unfortunately, the Parquet 
community does not publish the patched version yet, as a workaround, we can fix 
the issue on the Spark side first.

We encountered this memory issue when migrating data from parquet/snappy to 
parquet/zstd, Spark executors always occupy unreasonable off-heap memory and 
have a high risk of being killed by NM.

See more discussions at https://github.com/apache/parquet-mr/pull/982 and 
https://github.com/apache/iceberg/pull/5681

### Why are the changes needed?

The issue is fixed in the parquet community 
[PARQUET-2160](https://issues.apache.org/jira/browse/PARQUET-2160), but the 
patched version is not available yet.

### Does this PR introduce _any_ user-facing change?

Yes, it's bug fix.

### How was this patch tested?

The existing UT should cover the correctness check, I also verified this 
patch by scanning a large parquet/zstd table.

```
spark-shell --executor-cores 4 --executor-memory 6g --conf 
spark.executor.memoryOverhead=2g
```

```
spark.sql("select sum(hash(*)) from parquet_zstd_table ").show(false)
```

- before this patch

All executors get killed by NM quickly.
```
ERROR YarnScheduler: Lost executor 1 on hadoop-..org: Container 
killed by YARN for exceeding physical memory limits. 8.2 GB of 8 GB physical 
memory used. Consider boosting spark.executor.memoryOverhead.
```
https://user-images.githubusercontent.com/26535726/220031678-e9060244-5586-4f0c-8fe7-55bb4e20a580.png";>

- after this patch

Query runs well, no executor gets killed.

https://user-images.githubusercontent.com/26535726/220031917-4fe38c07-b38f-49c6-a982-2091a6c2a8ed.png";>

Closes #40091 from pan3793/SPARK-41952.

Authored-by: Cheng Pan 
Signed-off-by: Chao Sun 
---
 .../datasources/parquet/ParquetCodecFactory.java   | 112 +
 .../parquet/SpecificParquetRecordReaderBase.java   |   2 +
 2 files changed, 114 insertions(+)

diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetCodecFactory.java
 
b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetCodecFactory.java
new file mode 100644
index 000..2edbdc70da2
--- /dev/null
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetCodecFactory.java
@@ -0,0 +1,112 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.compress.CodecPool;
+import org.apache.hadoop.io.compress.CompressionCodec;
+import org.apache.hadoop.io.compress.Decompressor;
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.hadoop.CodecFactory;
+import org.apache.parquet.hadoop.codec.ZstandardCodec;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+
+/**
+ * This class implements a codec factory that is used when reading from 
Parquet. It adds a
+ * workaround for memory issues encountered when reading from zstd-compressed 
files. For
+ * details, see https://issues.apache.org/jira/browse/PARQUET-2160";>PARQUET-2160
+ *
+ * TODO: Remove this workaround after upgrading Parquet which include 
PARQUET-2160.
+ */
+public class ParquetCodecFactory extends CodecFactory {
+
+  public P

[spark] branch branch-3.4 updated: [SPARK-42423][SQL] Add metadata column file block start and length

2023-02-20 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 18d5d5e81ad [SPARK-42423][SQL] Add metadata column file block start 
and length
18d5d5e81ad is described below

commit 18d5d5e81adf70b94e968da6bd5bb783ff4ceb35
Author: ulysses-you 
AuthorDate: Mon Feb 20 22:33:36 2023 +0800

[SPARK-42423][SQL] Add metadata column file block start and length

### What changes were proposed in this pull request?

Support `_metadata.file_block_start` and `_metadata.file_block_length` for 
datasource file metadata columns.

Note that, it does not support data filter since we only know block start 
and length after splitting files.

### Why are the changes needed?

To improve the observability.

Currently, we have an built-in function `InputFileBlockStart` which has 
some issues, e.g. not work for join. It's better to encourage people changing 
to use the metadata column.

File block length is also an important information. People can find how 
Spark splits the big files.

### Does this PR introduce _any_ user-facing change?

yes

### How was this patch tested?

Improve exists test and add test

Closes #39996 from ulysses-you/SPARK-42423.

Authored-by: ulysses-you 
Signed-off-by: Wenchen Fan 
(cherry picked from commit ae97131f1afa5deac2bd183872cedd8829024efa)
Signed-off-by: Wenchen Fan 
---
 .../sql/execution/datasources/FileFormat.scala |  17 +++-
 .../sql/execution/datasources/FileScanRDD.scala|  11 ++-
 .../execution/datasources/FileSourceStrategy.scala |   1 +
 .../datasources/PartitioningAwareFileIndex.scala   |   5 +-
 .../FileMetadataStructRowIndexSuite.scala  |   3 +-
 .../datasources/FileMetadataStructSuite.scala  | 108 ++---
 6 files changed, 125 insertions(+), 20 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala
index 8811c1fd5f8..3d7e2c8bf3e 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala
@@ -184,6 +184,10 @@ object FileFormat {
 
   val FILE_NAME = "file_name"
 
+  val FILE_BLOCK_START = "file_block_start"
+
+  val FILE_BLOCK_LENGTH = "file_block_length"
+
   val FILE_SIZE = "file_size"
 
   val FILE_MODIFICATION_TIME = "file_modification_time"
@@ -212,6 +216,8 @@ object FileFormat {
 .add(StructField(FileFormat.FILE_PATH, StringType, nullable = false))
 .add(StructField(FileFormat.FILE_NAME, StringType, nullable = false))
 .add(StructField(FileFormat.FILE_SIZE, LongType, nullable = false))
+.add(StructField(FileFormat.FILE_BLOCK_START, LongType, nullable = false))
+.add(StructField(FileFormat.FILE_BLOCK_LENGTH, LongType, nullable = false))
 .add(StructField(FileFormat.FILE_MODIFICATION_TIME, TimestampType, 
nullable = false))
 
   /**
@@ -231,9 +237,12 @@ object FileFormat {
   fieldNames: Seq[String],
   filePath: Path,
   fileSize: Long,
-  fileModificationTime: Long): InternalRow =
+  fileModificationTime: Long): InternalRow = {
+// We are not aware of `FILE_BLOCK_START` and `FILE_BLOCK_LENGTH` before 
splitting files
+assert(!fieldNames.contains(FILE_BLOCK_START) && 
!fieldNames.contains(FILE_BLOCK_LENGTH))
 updateMetadataInternalRow(new GenericInternalRow(fieldNames.length), 
fieldNames,
-  filePath, fileSize, fileModificationTime)
+  filePath, fileSize, 0L, fileSize, fileModificationTime)
+  }
 
   // update an internal row given required metadata fields and file information
   def updateMetadataInternalRow(
@@ -241,12 +250,16 @@ object FileFormat {
   fieldNames: Seq[String],
   filePath: Path,
   fileSize: Long,
+  fileBlockStart: Long,
+  fileBlockLength: Long,
   fileModificationTime: Long): InternalRow = {
 fieldNames.zipWithIndex.foreach { case (name, i) =>
   name match {
 case FILE_PATH => row.update(i, 
UTF8String.fromString(filePath.toString))
 case FILE_NAME => row.update(i, 
UTF8String.fromString(filePath.getName))
 case FILE_SIZE => row.update(i, fileSize)
+case FILE_BLOCK_START => row.update(i, fileBlockStart)
+case FILE_BLOCK_LENGTH => row.update(i, fileBlockLength)
 case FILE_MODIFICATION_TIME =>
   // the modificationTime from the file is in millisecond,
   // while internally, the TimestampType `file_modification_time` is 
stored in microsecond
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/

[spark] branch master updated: [SPARK-42423][SQL] Add metadata column file block start and length

2023-02-20 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ae97131f1af [SPARK-42423][SQL] Add metadata column file block start 
and length
ae97131f1af is described below

commit ae97131f1afa5deac2bd183872cedd8829024efa
Author: ulysses-you 
AuthorDate: Mon Feb 20 22:33:36 2023 +0800

[SPARK-42423][SQL] Add metadata column file block start and length

### What changes were proposed in this pull request?

Support `_metadata.file_block_start` and `_metadata.file_block_length` for 
datasource file metadata columns.

Note that, it does not support data filter since we only know block start 
and length after splitting files.

### Why are the changes needed?

To improve the observability.

Currently, we have an built-in function `InputFileBlockStart` which has 
some issues, e.g. not work for join. It's better to encourage people changing 
to use the metadata column.

File block length is also an important information. People can find how 
Spark splits the big files.

### Does this PR introduce _any_ user-facing change?

yes

### How was this patch tested?

Improve exists test and add test

Closes #39996 from ulysses-you/SPARK-42423.

Authored-by: ulysses-you 
Signed-off-by: Wenchen Fan 
---
 .../sql/execution/datasources/FileFormat.scala |  17 +++-
 .../sql/execution/datasources/FileScanRDD.scala|  11 ++-
 .../execution/datasources/FileSourceStrategy.scala |   1 +
 .../datasources/PartitioningAwareFileIndex.scala   |   5 +-
 .../FileMetadataStructRowIndexSuite.scala  |   3 +-
 .../datasources/FileMetadataStructSuite.scala  | 108 ++---
 6 files changed, 125 insertions(+), 20 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala
index 8811c1fd5f8..3d7e2c8bf3e 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala
@@ -184,6 +184,10 @@ object FileFormat {
 
   val FILE_NAME = "file_name"
 
+  val FILE_BLOCK_START = "file_block_start"
+
+  val FILE_BLOCK_LENGTH = "file_block_length"
+
   val FILE_SIZE = "file_size"
 
   val FILE_MODIFICATION_TIME = "file_modification_time"
@@ -212,6 +216,8 @@ object FileFormat {
 .add(StructField(FileFormat.FILE_PATH, StringType, nullable = false))
 .add(StructField(FileFormat.FILE_NAME, StringType, nullable = false))
 .add(StructField(FileFormat.FILE_SIZE, LongType, nullable = false))
+.add(StructField(FileFormat.FILE_BLOCK_START, LongType, nullable = false))
+.add(StructField(FileFormat.FILE_BLOCK_LENGTH, LongType, nullable = false))
 .add(StructField(FileFormat.FILE_MODIFICATION_TIME, TimestampType, 
nullable = false))
 
   /**
@@ -231,9 +237,12 @@ object FileFormat {
   fieldNames: Seq[String],
   filePath: Path,
   fileSize: Long,
-  fileModificationTime: Long): InternalRow =
+  fileModificationTime: Long): InternalRow = {
+// We are not aware of `FILE_BLOCK_START` and `FILE_BLOCK_LENGTH` before 
splitting files
+assert(!fieldNames.contains(FILE_BLOCK_START) && 
!fieldNames.contains(FILE_BLOCK_LENGTH))
 updateMetadataInternalRow(new GenericInternalRow(fieldNames.length), 
fieldNames,
-  filePath, fileSize, fileModificationTime)
+  filePath, fileSize, 0L, fileSize, fileModificationTime)
+  }
 
   // update an internal row given required metadata fields and file information
   def updateMetadataInternalRow(
@@ -241,12 +250,16 @@ object FileFormat {
   fieldNames: Seq[String],
   filePath: Path,
   fileSize: Long,
+  fileBlockStart: Long,
+  fileBlockLength: Long,
   fileModificationTime: Long): InternalRow = {
 fieldNames.zipWithIndex.foreach { case (name, i) =>
   name match {
 case FILE_PATH => row.update(i, 
UTF8String.fromString(filePath.toString))
 case FILE_NAME => row.update(i, 
UTF8String.fromString(filePath.getName))
 case FILE_SIZE => row.update(i, fileSize)
+case FILE_BLOCK_START => row.update(i, fileBlockStart)
+case FILE_BLOCK_LENGTH => row.update(i, fileBlockLength)
 case FILE_MODIFICATION_TIME =>
   // the modificationTime from the file is in millisecond,
   // while internally, the TimestampType `file_modification_time` is 
stored in microsecond
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
index 0ccf72823f1..7fb2d9c8ac7 100644
--- 
a/sql/core/src/main/scala/org/

[spark] branch branch-3.4 updated: [SPARK-42476][CONNECT][DOCS] Complete Spark Connect API reference

2023-02-20 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new ad0607aa693 [SPARK-42476][CONNECT][DOCS] Complete Spark Connect API 
reference
ad0607aa693 is described below

commit ad0607aa693eca86b02c113a6e8b2ae0e7afade2
Author: itholic 
AuthorDate: Mon Feb 20 23:23:17 2023 +0900

[SPARK-42476][CONNECT][DOCS] Complete Spark Connect API reference

### What changes were proposed in this pull request?

This PR proposes to complete missing API Reference for Spark Connect.

Built API docs should include "Changed in version" for Spark Connect when 
it's implemented as below:

https://user-images.githubusercontent.com/44108233/219986313-374e0959-b8c5-44f6-942c-bba1c0407909.png";>

### Why are the changes needed?

Improving usability for Spark Connect.

### Does this PR introduce _any_ user-facing change?

No, it's documentation.

### How was this patch tested?

Manually built docs, confirmed each function and class one by one.

Closes #40067 from itholic/SPARK-42476.

Lead-authored-by: itholic 
Co-authored-by: Haejoon Lee <44108233+itho...@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit e6c201df33b123c3bfc632012abeaa0db6c417bc)
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/ml/feature.py |  2 +-
 python/pyspark/sql/column.py | 27 ---
 python/pyspark/sql/connect/client.py | 30 +++---
 python/pyspark/sql/connect/column.py | 12 
 python/pyspark/sql/connect/udf.py|  2 ++
 python/pyspark/sql/dataframe.py  | 27 +++
 python/pyspark/sql/functions.py  | 33 +
 python/pyspark/sql/group.py  |  3 +++
 python/pyspark/sql/session.py|  8 
 9 files changed, 137 insertions(+), 7 deletions(-)

diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py
index 43e658d7f69..ff7aaf71f9c 100755
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@@ -3476,7 +3476,7 @@ class QuantileDiscretizer(
 non-NaN data will be put into buckets[0-3], but NaNs will be counted in a 
special bucket[4].
 
 Algorithm: The bin ranges are chosen using an approximate algorithm (see 
the documentation for
-:py:meth:`~.DataFrameStatFunctions.approxQuantile` for a detailed 
description).
+:py:meth:`pyspark.sql.DataFrameStatFunctions.approxQuantile` for a 
detailed description).
 The precision of the approximation can be controlled with the
 :py:attr:`relativeError` parameter.
 The lower and upper bin bounds will be `-Infinity` and `+Infinity`, 
covering all real values.
diff --git a/python/pyspark/sql/column.py b/python/pyspark/sql/column.py
index 0b5f94cfaaa..bcf6676d5ca 100644
--- a/python/pyspark/sql/column.py
+++ b/python/pyspark/sql/column.py
@@ -669,14 +669,14 @@ class Column:
 _startswith_doc = """
 String starts with. Returns a boolean :class:`Column` based on a string 
match.
 
+.. versionchanged:: 3.4.0
+Support Spark Connect.
+
 Parameters
 --
 other : :class:`Column` or str
 string at start of line (do not use a regex `^`)
 
-.. versionchanged:: 3.4.0
-Support Spark Connect.
-
 Examples
 
 >>> df = spark.createDataFrame(
@@ -903,6 +903,9 @@ class Column:
 _asc_doc = """
 Returns a sort expression based on the ascending order of the column.
 
+.. versionchanged:: 3.4.0
+Support Spark Connect.
+
 Examples
 
 >>> from pyspark.sql import Row
@@ -916,6 +919,9 @@ class Column:
 
 .. versionadded:: 2.4.0
 
+.. versionchanged:: 3.4.0
+Support Spark Connect.
+
 Examples
 
 >>> from pyspark.sql import Row
@@ -930,6 +936,9 @@ class Column:
 
 .. versionadded:: 2.4.0
 
+.. versionchanged:: 3.4.0
+Support Spark Connect.
+
 Examples
 
 >>> from pyspark.sql import Row
@@ -943,6 +952,9 @@ class Column:
 
 .. versionadded:: 2.4.0
 
+.. versionchanged:: 3.4.0
+Support Spark Connect.
+
 Examples
 
 >>> from pyspark.sql import Row
@@ -956,6 +968,9 @@ class Column:
 
 .. versionadded:: 2.4.0
 
+.. versionchanged:: 3.4.0
+Support Spark Connect.
+
 Examples
 
 >>> from pyspark.sql import Row
@@ -970,6 +985,9 @@ class Column:
 
 .. versionadded:: 2.4.0
 
+.. versionchanged:: 3.4.0
+Support Spark Connect.
+
 Examples
 
 >>> from pyspark.sql import Row
@@ -1128,6 +1146,9 @@ class Column:
 
 .. versionadded:: 1.3.0
 
+.. versionchanged:: 3.4.0
+Support Spark Connect.
+
 Paramete

[spark] branch master updated: [SPARK-42476][CONNECT][DOCS] Complete Spark Connect API reference

2023-02-20 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e6c201df33b [SPARK-42476][CONNECT][DOCS] Complete Spark Connect API 
reference
e6c201df33b is described below

commit e6c201df33b123c3bfc632012abeaa0db6c417bc
Author: itholic 
AuthorDate: Mon Feb 20 23:23:17 2023 +0900

[SPARK-42476][CONNECT][DOCS] Complete Spark Connect API reference

### What changes were proposed in this pull request?

This PR proposes to complete missing API Reference for Spark Connect.

Built API docs should include "Changed in version" for Spark Connect when 
it's implemented as below:

https://user-images.githubusercontent.com/44108233/219986313-374e0959-b8c5-44f6-942c-bba1c0407909.png";>

### Why are the changes needed?

Improving usability for Spark Connect.

### Does this PR introduce _any_ user-facing change?

No, it's documentation.

### How was this patch tested?

Manually built docs, confirmed each function and class one by one.

Closes #40067 from itholic/SPARK-42476.

Lead-authored-by: itholic 
Co-authored-by: Haejoon Lee <44108233+itho...@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/ml/feature.py |  2 +-
 python/pyspark/sql/column.py | 27 ---
 python/pyspark/sql/connect/client.py | 30 +++---
 python/pyspark/sql/connect/column.py | 12 
 python/pyspark/sql/connect/udf.py|  2 ++
 python/pyspark/sql/dataframe.py  | 27 +++
 python/pyspark/sql/functions.py  | 33 +
 python/pyspark/sql/group.py  |  3 +++
 python/pyspark/sql/session.py|  8 
 9 files changed, 137 insertions(+), 7 deletions(-)

diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py
index 43e658d7f69..ff7aaf71f9c 100755
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@@ -3476,7 +3476,7 @@ class QuantileDiscretizer(
 non-NaN data will be put into buckets[0-3], but NaNs will be counted in a 
special bucket[4].
 
 Algorithm: The bin ranges are chosen using an approximate algorithm (see 
the documentation for
-:py:meth:`~.DataFrameStatFunctions.approxQuantile` for a detailed 
description).
+:py:meth:`pyspark.sql.DataFrameStatFunctions.approxQuantile` for a 
detailed description).
 The precision of the approximation can be controlled with the
 :py:attr:`relativeError` parameter.
 The lower and upper bin bounds will be `-Infinity` and `+Infinity`, 
covering all real values.
diff --git a/python/pyspark/sql/column.py b/python/pyspark/sql/column.py
index 0b5f94cfaaa..bcf6676d5ca 100644
--- a/python/pyspark/sql/column.py
+++ b/python/pyspark/sql/column.py
@@ -669,14 +669,14 @@ class Column:
 _startswith_doc = """
 String starts with. Returns a boolean :class:`Column` based on a string 
match.
 
+.. versionchanged:: 3.4.0
+Support Spark Connect.
+
 Parameters
 --
 other : :class:`Column` or str
 string at start of line (do not use a regex `^`)
 
-.. versionchanged:: 3.4.0
-Support Spark Connect.
-
 Examples
 
 >>> df = spark.createDataFrame(
@@ -903,6 +903,9 @@ class Column:
 _asc_doc = """
 Returns a sort expression based on the ascending order of the column.
 
+.. versionchanged:: 3.4.0
+Support Spark Connect.
+
 Examples
 
 >>> from pyspark.sql import Row
@@ -916,6 +919,9 @@ class Column:
 
 .. versionadded:: 2.4.0
 
+.. versionchanged:: 3.4.0
+Support Spark Connect.
+
 Examples
 
 >>> from pyspark.sql import Row
@@ -930,6 +936,9 @@ class Column:
 
 .. versionadded:: 2.4.0
 
+.. versionchanged:: 3.4.0
+Support Spark Connect.
+
 Examples
 
 >>> from pyspark.sql import Row
@@ -943,6 +952,9 @@ class Column:
 
 .. versionadded:: 2.4.0
 
+.. versionchanged:: 3.4.0
+Support Spark Connect.
+
 Examples
 
 >>> from pyspark.sql import Row
@@ -956,6 +968,9 @@ class Column:
 
 .. versionadded:: 2.4.0
 
+.. versionchanged:: 3.4.0
+Support Spark Connect.
+
 Examples
 
 >>> from pyspark.sql import Row
@@ -970,6 +985,9 @@ class Column:
 
 .. versionadded:: 2.4.0
 
+.. versionchanged:: 3.4.0
+Support Spark Connect.
+
 Examples
 
 >>> from pyspark.sql import Row
@@ -1128,6 +1146,9 @@ class Column:
 
 .. versionadded:: 1.3.0
 
+.. versionchanged:: 3.4.0
+Support Spark Connect.
+
 Parameters
 --
 lowerBound : :class:`Column`, int, float, string, bool, datetime, date 
or Decimal

[spark] branch master updated: [SPARK-42490][BUILD] Upgrade protobuf-java from 3.21.12 to 3.22.0

2023-02-20 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 66ab715a6be [SPARK-42490][BUILD] Upgrade protobuf-java from 3.21.12 to 
3.22.0
66ab715a6be is described below

commit 66ab715a6bef2d88edc33f146a6c7a504cc7c388
Author: yangjie01 
AuthorDate: Mon Feb 20 08:17:13 2023 -0600

[SPARK-42490][BUILD] Upgrade protobuf-java from 3.21.12 to 3.22.0

### What changes were proposed in this pull request?
This pr aims upgrade `protobuf-java` from 3.21.12 to 3.22.0

### Why are the changes needed?
The new version bring some improvements like:

- Use bit-field int values in buildPartial to skip work on unset groups of 
fields. 
(https://github.com/protocolbuffers/protobuf/commit/2326aef1a454a4eea363cc6ed8b8def8b88365f5)
- Fix serialization warnings in generated code when compiling with Java 18 
and above (https://github.com/protocolbuffers/protobuf/pull/10561)
- Enable Text format parser to skip unknown short-formed repeated fields. 
(https://github.com/protocolbuffers/protobuf/commit/6dbd4131fa6b2ad29b2b1b827f21fc61b160aeeb)
- Add serialVersionUID to ByteString and subclasses 
(https://github.com/protocolbuffers/protobuf/pull/10718)

and some bug fix like:
- Mark default instance as immutable first to avoid race during static 
initialization of default instances. 
(https://github.com/protocolbuffers/protobuf/pull/10770)

- Fix Timestamps fromDate for negative 'exact second' java.sql.Timestamps 
(https://github.com/protocolbuffers/protobuf/pull/10321)
- Fix Timestamps.fromDate to correctly handle java.sql.Timestamps before 
unix epoch (https://github.com/protocolbuffers/protobuf/pull/10126)
- Fix bug in nested builder caching logic where cleared sub-field builders 
would remain dirty after a clear and build in a parent layer. 
https://github.com/protocolbuffers/protobuf/issues/10624

The release notes as follows:

- https://github.com/protocolbuffers/protobuf/releases/tag/v22.0

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

Closes #40084 from LuciferYang/SPARK-42490.

Authored-by: yangjie01 
Signed-off-by: Sean Owen 
---
 pom.xml  | 2 +-
 project/SparkBuild.scala | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/pom.xml b/pom.xml
index 7a81101d2d4..ae65be7d3e3 100644
--- a/pom.xml
+++ b/pom.xml
@@ -124,7 +124,7 @@
 
2.5.0
 
 
-3.21.12
+3.22.0
 3.11.4
 ${hadoop.version}
 3.6.3
diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala
index 4b077f593fe..2c3907bc734 100644
--- a/project/SparkBuild.scala
+++ b/project/SparkBuild.scala
@@ -88,7 +88,7 @@ object BuildCommons {
 
   // Google Protobuf version used for generating the protobuf.
   // SPARK-41247: needs to be consistent with `protobuf.version` in `pom.xml`.
-  val protoVersion = "3.21.12"
+  val protoVersion = "3.22.0"
   // GRPC version used for Spark Connect.
   val gprcVersion = "1.47.0"
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-42489][BUILD] Upgrdae scala-parser-combinators from 2.1.1 to 2.2.0

2023-02-20 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 09d1e947264 [SPARK-42489][BUILD] Upgrdae scala-parser-combinators from 
2.1.1 to 2.2.0
09d1e947264 is described below

commit 09d1e9472642a4ca76cd320f86e1c4373842b113
Author: yangjie01 
AuthorDate: Mon Feb 20 08:16:04 2023 -0600

[SPARK-42489][BUILD] Upgrdae scala-parser-combinators from 2.1.1 to 2.2.0

### What changes were proposed in this pull request?
This pr aims upgrade `scala-parser-combinators from` from 2.1.1 to 2.2.0.

### Why are the changes needed?
https://github.com/scala/scala-parser-combinators/pull/496 add 
`NoSuccess.I` to helps users avoid exhaustiveness warnings in their pattern 
matches, especially on Scala 2.13 and 3. The full release note as follows:
- https://github.com/scala/scala-parser-combinators/releases/tag/v2.2.0

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

Closes #40083 from LuciferYang/SPARK-42489.

Authored-by: yangjie01 
Signed-off-by: Sean Owen 
---
 dev/deps/spark-deps-hadoop-2-hive-2.3 | 2 +-
 dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 +-
 pom.xml   | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-2-hive-2.3 
b/dev/deps/spark-deps-hadoop-2-hive-2.3
index 57739a7c0ff..f858c9782cc 100644
--- a/dev/deps/spark-deps-hadoop-2-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-2-hive-2.3
@@ -242,7 +242,7 @@ rocksdbjni/7.9.2//rocksdbjni-7.9.2.jar
 scala-collection-compat_2.12/2.7.0//scala-collection-compat_2.12-2.7.0.jar
 scala-compiler/2.12.17//scala-compiler-2.12.17.jar
 scala-library/2.12.17//scala-library-2.12.17.jar
-scala-parser-combinators_2.12/2.1.1//scala-parser-combinators_2.12-2.1.1.jar
+scala-parser-combinators_2.12/2.2.0//scala-parser-combinators_2.12-2.2.0.jar
 scala-reflect/2.12.17//scala-reflect-2.12.17.jar
 scala-xml_2.12/2.1.0//scala-xml_2.12-2.1.0.jar
 shims/0.9.39//shims-0.9.39.jar
diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index 3b54e0365e0..01345fd13ff 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -229,7 +229,7 @@ rocksdbjni/7.9.2//rocksdbjni-7.9.2.jar
 scala-collection-compat_2.12/2.7.0//scala-collection-compat_2.12-2.7.0.jar
 scala-compiler/2.12.17//scala-compiler-2.12.17.jar
 scala-library/2.12.17//scala-library-2.12.17.jar
-scala-parser-combinators_2.12/2.1.1//scala-parser-combinators_2.12-2.1.1.jar
+scala-parser-combinators_2.12/2.2.0//scala-parser-combinators_2.12-2.2.0.jar
 scala-reflect/2.12.17//scala-reflect-2.12.17.jar
 scala-xml_2.12/2.1.0//scala-xml_2.12-2.1.0.jar
 shims/0.9.39//shims-0.9.39.jar
diff --git a/pom.xml b/pom.xml
index e2fee86682d..7a81101d2d4 100644
--- a/pom.xml
+++ b/pom.xml
@@ -1119,7 +1119,7 @@
   
 org.scala-lang.modules
 
scala-parser-combinators_${scala.binary.version}
-2.1.1
+2.2.0
   
   
 jline


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.4 updated: [SPARK-42477][CONNECT][PYTHON] accept user_agent in spark connect's connection string

2023-02-20 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 014c60fb5bf [SPARK-42477][CONNECT][PYTHON] accept user_agent in spark 
connect's connection string
014c60fb5bf is described below

commit 014c60fb5bf712afafca4eef884665d4245d4aaf
Author: Niranjan Jayakar 
AuthorDate: Mon Feb 20 23:02:11 2023 +0900

[SPARK-42477][CONNECT][PYTHON] accept user_agent in spark connect's 
connection string

### What changes were proposed in this pull request?

Currently, the Spark Connect service's `client_type` attribute (which is 
really [user
agent]) is set to `_SPARK_CONNECT_PYTHON` to signify PySpark.

With this change, the connection for the Spark Connect remote accepts an 
optional
`user_agent` parameter which is then passed down to the service.

[user agent]: https://www.w3.org/WAI/UA/work/wiki/Definition_of_User_Agent

### Why are the changes needed?

This enables partners using Spark Connect to set their application as the 
user agent,
which then allows visibility and measurement of integrations and usages of 
spark
connect.

### Does this PR introduce _any_ user-facing change?

A new optional `user_agent` parameter is now recognized as part of the 
Spark Connect
connection string.

### How was this patch tested?

- unit tests attached
- manually running the `pyspark` binary with the `user_agent` connection 
string set and
   verifying the payload sent to the server. Similar testing for the 
default.

Closes #40054 from nija-at/user-agent.

Authored-by: Niranjan Jayakar 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit b887d3de954ae5b2482087fe08affcc4ac60c669)
Signed-off-by: Hyukjin Kwon 
---
 connector/connect/docs/client-connection-string.md | 12 +++-
 dev/sparktestsupport/modules.py|  1 +
 python/pyspark/sql/connect/client.py   | 29 +-
 python/pyspark/sql/tests/connect/test_client.py| 67 ++
 .../sql/tests/connect/test_connect_basic.py| 28 -
 5 files changed, 132 insertions(+), 5 deletions(-)

diff --git a/connector/connect/docs/client-connection-string.md 
b/connector/connect/docs/client-connection-string.md
index 8f1f0b8c631..6e5b0c80db7 100644
--- a/connector/connect/docs/client-connection-string.md
+++ b/connector/connect/docs/client-connection-string.md
@@ -58,7 +58,8 @@ sc://hostname:port/;param1=value;param2=value
 token
 String
 When this param is set in the URL, it will enable standard
-bearer token authentication using GRPC. By default this value is not 
set.
+bearer token authentication using GRPC. By default this value is not set.
+Setting this value enables SSL.
 token=ABCDEFGH
   
   
@@ -81,6 +82,15 @@ sc://hostname:port/;param1=value;param2=value
 user_id=Martin
 
   
+  
+user_agent
+String
+The user agent acting on behalf of the user, typically applications
+that use Spark Connect to implement its functionality and execute Spark
+requests on behalf of the user.
+Default: _SPARK_CONNECT_PYTHON in the Python client
+user_agent=my_data_query_app
+  
 
 
 ## Examples
diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py
index 94ae1ffbce6..75a6b4401b8 100644
--- a/dev/sparktestsupport/modules.py
+++ b/dev/sparktestsupport/modules.py
@@ -516,6 +516,7 @@ pyspark_connect = Module(
 "pyspark.sql.connect.dataframe",
 "pyspark.sql.connect.functions",
 # unittests
+"pyspark.sql.tests.connect.test_client",
 "pyspark.sql.tests.connect.test_connect_plan",
 "pyspark.sql.tests.connect.test_connect_basic",
 "pyspark.sql.tests.connect.test_connect_function",
diff --git a/python/pyspark/sql/connect/client.py 
b/python/pyspark/sql/connect/client.py
index aade0f6e050..78190b2c488 100644
--- a/python/pyspark/sql/connect/client.py
+++ b/python/pyspark/sql/connect/client.py
@@ -19,6 +19,8 @@ __all__ = [
 "SparkConnectClient",
 ]
 
+import string
+
 from pyspark.sql.connect.utils import check_dependencies
 
 check_dependencies(__name__, __file__)
@@ -120,6 +122,7 @@ class ChannelBuilder:
 PARAM_USE_SSL = "use_ssl"
 PARAM_TOKEN = "token"
 PARAM_USER_ID = "user_id"
+PARAM_USER_AGENT = "user_agent"
 
 @staticmethod
 def default_port() -> int:
@@ -215,6 +218,7 @@ class ChannelBuilder:
 ChannelBuilder.PARAM_TOKEN,
 ChannelBuilder.PARAM_USE_SSL,
 ChannelBuilder.PARAM_USER_ID,
+ChannelBuilder.PARAM_USER_AGENT,
 ]
 ]
 
@@ -244,6 +248,27 @@ class ChannelBuilder:
 """
 return self.params.get(ChannelBuilder.PARAM_USER_ID, None)
 
+@pro

[spark] branch master updated (d5fa41efe2b -> b887d3de954)

2023-02-20 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from d5fa41efe2b [SPARK-41741][SQL] Encode the string using the UTF_8 
charset in ParquetFilters
 add b887d3de954 [SPARK-42477][CONNECT][PYTHON] accept user_agent in spark 
connect's connection string

No new revisions were added by this update.

Summary of changes:
 connector/connect/docs/client-connection-string.md | 12 +++-
 dev/sparktestsupport/modules.py|  1 +
 python/pyspark/sql/connect/client.py   | 29 +-
 python/pyspark/sql/tests/connect/test_client.py| 67 ++
 .../sql/tests/connect/test_connect_basic.py| 28 -
 5 files changed, 132 insertions(+), 5 deletions(-)
 create mode 100644 python/pyspark/sql/tests/connect/test_client.py


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.3 updated: [SPARK-41741][SQL] Encode the string using the UTF_8 charset in ParquetFilters

2023-02-20 Thread yumwang

This is an automated email from the ASF dual-hosted git repository.

yumwang pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 8a490eaf2c4 [SPARK-41741][SQL] Encode the string using the UTF_8 
charset in ParquetFilters
8a490eaf2c4 is described below

commit 8a490eaf2c48de413924560c869ab53a5de6e303
Author: Yuming Wang 
AuthorDate: Mon Feb 20 19:15:30 2023 +0800

[SPARK-41741][SQL] Encode the string using the UTF_8 charset in 
ParquetFilters

This PR makes it encode the string using the `UTF_8` charset in 
`ParquetFilters`.

Fix data issue where the default charset is not `UTF_8`.

No.

Manual test.

Closes #40090 from wangyum/SPARK-41741.

Authored-by: Yuming Wang 
Signed-off-by: Yuming Wang 
(cherry picked from commit d5fa41efe2b1aa0aa41f558c1bef048b4632cf5c)
Signed-off-by: Yuming Wang 
---
 .../spark/sql/execution/datasources/parquet/ParquetFilters.scala   | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
index e04019fa9a0..210f37d473a 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
@@ -19,6 +19,7 @@ package org.apache.spark.sql.execution.datasources.parquet
 
 import java.lang.{Boolean => JBoolean, Double => JDouble, Float => JFloat, 
Long => JLong}
 import java.math.{BigDecimal => JBigDecimal}
+import java.nio.charset.StandardCharsets.UTF_8
 import java.sql.{Date, Timestamp}
 import java.time.{Duration, Instant, LocalDate, Period}
 import java.util.Locale
@@ -767,7 +768,7 @@ class ParquetFilters(
 Option(prefix).map { v =>
   
FilterApi.userDefined(binaryColumn(nameToParquetField(name).fieldNames),
 new UserDefinedPredicate[Binary] with Serializable {
-  private val strToBinary = Binary.fromReusedByteArray(v.getBytes)
+  private val strToBinary = 
Binary.fromReusedByteArray(v.getBytes(UTF_8))
   private val size = strToBinary.length
 
   override def canDrop(statistics: Statistics[Binary]): Boolean = {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.4 updated: [SPARK-41741][SQL] Encode the string using the UTF_8 charset in ParquetFilters

2023-02-20 Thread yumwang

This is an automated email from the ASF dual-hosted git repository.

yumwang pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new ee4cee0bda5 [SPARK-41741][SQL] Encode the string using the UTF_8 
charset in ParquetFilters
ee4cee0bda5 is described below

commit ee4cee0bda5cce4aeb371b7d9b35ba0975b615c6
Author: Yuming Wang 
AuthorDate: Mon Feb 20 19:15:30 2023 +0800

[SPARK-41741][SQL] Encode the string using the UTF_8 charset in 
ParquetFilters

This PR makes it encode the string using the `UTF_8` charset in 
`ParquetFilters`.

Fix data issue where the default charset is not `UTF_8`.

No.

Manual test.

Closes #40090 from wangyum/SPARK-41741.

Authored-by: Yuming Wang 
Signed-off-by: Yuming Wang 
(cherry picked from commit d5fa41efe2b1aa0aa41f558c1bef048b4632cf5c)
Signed-off-by: Yuming Wang 
---
 .../spark/sql/execution/datasources/parquet/ParquetFilters.scala   | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
index c34f2827659..6994e1ba39d 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
@@ -19,6 +19,7 @@ package org.apache.spark.sql.execution.datasources.parquet
 
 import java.lang.{Boolean => JBoolean, Double => JDouble, Float => JFloat, 
Long => JLong}
 import java.math.{BigDecimal => JBigDecimal}
+import java.nio.charset.StandardCharsets.UTF_8
 import java.sql.{Date, Timestamp}
 import java.time.{Duration, Instant, LocalDate, Period}
 import java.util.HashSet
@@ -776,7 +777,7 @@ class ParquetFilters(
 Option(prefix).map { v =>
   
FilterApi.userDefined(binaryColumn(nameToParquetField(name).fieldNames),
 new UserDefinedPredicate[Binary] with Serializable {
-  private val strToBinary = Binary.fromReusedByteArray(v.getBytes)
+  private val strToBinary = 
Binary.fromReusedByteArray(v.getBytes(UTF_8))
   private val size = strToBinary.length
 
   override def canDrop(statistics: Statistics[Binary]): Boolean = {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-41741][SQL] Encode the string using the UTF_8 charset in ParquetFilters

2023-02-20 Thread yumwang

This is an automated email from the ASF dual-hosted git repository.

yumwang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d5fa41efe2b [SPARK-41741][SQL] Encode the string using the UTF_8 
charset in ParquetFilters
d5fa41efe2b is described below

commit d5fa41efe2b1aa0aa41f558c1bef048b4632cf5c
Author: Yuming Wang 
AuthorDate: Mon Feb 20 19:15:30 2023 +0800

[SPARK-41741][SQL] Encode the string using the UTF_8 charset in 
ParquetFilters

### What changes were proposed in this pull request?

This PR makes it encode the string using the `UTF_8` charset in 
`ParquetFilters`.

### Why are the changes needed?

Fix data issue where the default charset is not `UTF_8`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

Closes #40090 from wangyum/SPARK-41741.

Authored-by: Yuming Wang 
Signed-off-by: Yuming Wang 
---
 .../spark/sql/execution/datasources/parquet/ParquetFilters.scala   | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
index c34f2827659..6994e1ba39d 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
@@ -19,6 +19,7 @@ package org.apache.spark.sql.execution.datasources.parquet
 
 import java.lang.{Boolean => JBoolean, Double => JDouble, Float => JFloat, 
Long => JLong}
 import java.math.{BigDecimal => JBigDecimal}
+import java.nio.charset.StandardCharsets.UTF_8
 import java.sql.{Date, Timestamp}
 import java.time.{Duration, Instant, LocalDate, Period}
 import java.util.HashSet
@@ -776,7 +777,7 @@ class ParquetFilters(
 Option(prefix).map { v =>
   
FilterApi.userDefined(binaryColumn(nameToParquetField(name).fieldNames),
 new UserDefinedPredicate[Binary] with Serializable {
-  private val strToBinary = Binary.fromReusedByteArray(v.getBytes)
+  private val strToBinary = 
Binary.fromReusedByteArray(v.getBytes(UTF_8))
   private val size = strToBinary.length
 
   override def canDrop(statistics: Statistics[Binary]): Boolean = {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark-website] branch asf-site updated: Update doc. version on index site.

2023-02-20 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/spark-website.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 22f2298935 Update doc. version on index site.
22f2298935 is described below

commit 22f2298935afea3aa578aab43b06d5333cada46f
Author: Bjørn 
AuthorDate: Mon Feb 20 20:08:02 2023 +0900

Update doc. version on index site.



This PR will fix this on the frontpage from 3.3.1 -> 3.3.2

![Screenshot from 2023-02-19 
22-45-34](https://user-images.githubusercontent.com/47577197/219977098-3533b812-7798-4404-ad11-20b9f714a599.png)

Author: Bjørn 
Author: bjornjorgensen 

Closes #439 from bjornjorgensen/doc.version-homepage.
---
 _layouts/home.html | 2 +-
 site/index.html| 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/_layouts/home.html b/_layouts/home.html
index c693101258..2c253901c5 100644
--- a/_layouts/home.html
+++ b/_layouts/home.html
@@ -71,7 +71,7 @@
   Documentation
 
 
-  Latest Release (Spark 3.3.1)
+  Latest Release
   Older Versions and Other 
Resources
   Frequently Asked Questions
 
diff --git a/site/index.html b/site/index.html
index 4722e6c5e0..e1b0b7e416 100644
--- a/site/index.html
+++ b/site/index.html
@@ -67,7 +67,7 @@
   Documentation
 
 
-  Latest Release 
(Spark 3.3.1)
+  Latest 
Release
   Older 
Versions and Other Resources
   Frequently Asked 
Questions
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] HyukjinKwon closed pull request #439: Update doc. version on index site.

2023-02-20 Thread via GitHub



HyukjinKwon closed pull request #439: Update doc. version on index site. 
URL: https://github.com/apache/spark-website/pull/439


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] HyukjinKwon commented on pull request #439: Update doc. version on index site.

2023-02-20 Thread via GitHub



HyukjinKwon commented on PR #439:
URL: https://github.com/apache/spark-website/pull/439#issuecomment-1436762613

   Merged
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] [spark-website] bjornjorgensen commented on pull request #439: Update doc. version on index site.

2023-02-20 Thread via GitHub



bjornjorgensen commented on PR #439:
URL: https://github.com/apache/spark-website/pull/439#issuecomment-1436709551

   Ok, so from this  `Latest Release (Spark 3.3.2)` to `Latest Release` ?  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.4 updated: [SPARK-41959][SQL] Improve v1 writes with empty2null

2023-02-20 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 5d617e347c3 [SPARK-41959][SQL] Improve v1 writes with empty2null
5d617e347c3 is described below

commit 5d617e347c358114b1cba9426dd854e68dcadeef
Author: ulysses-you 
AuthorDate: Mon Feb 20 16:41:09 2023 +0800

[SPARK-41959][SQL] Improve v1 writes with empty2null

### What changes were proposed in this pull request?

Cleanup some unnecessary `Empty2Null` related code

### Why are the changes needed?

V1Writes checked idempotency using WriteFiles, so it's unnecessary to check 
if empty2null if exists again.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

pass CI

Closes #39475 from ulysses-you/SPARK-41959.

Authored-by: ulysses-you 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 547737b82dfee7e800930fd91bf2761263f68881)
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/execution/datasources/FileFormatWriter.scala |  9 ++---
 .../org/apache/spark/sql/execution/datasources/V1Writes.scala  | 10 ++
 2 files changed, 4 insertions(+), 15 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
index 2491c9d7754..8321b1fac71 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
@@ -206,13 +206,8 @@ object FileFormatWriter extends Logging {
   partitionColumns: Seq[Attribute],
   sortColumns: Seq[Attribute],
   orderingMatched: Boolean): Set[String] = {
-val hasEmpty2Null = plan.exists(p => 
V1WritesUtils.hasEmptyToNull(p.expressions))
-val empty2NullPlan = if (hasEmpty2Null) {
-  plan
-} else {
-  val projectList = V1WritesUtils.convertEmptyToNull(plan.output, 
partitionColumns)
-  if (projectList.nonEmpty) ProjectExec(projectList, plan) else plan
-}
+val projectList = V1WritesUtils.convertEmptyToNull(plan.output, 
partitionColumns)
+val empty2NullPlan = if (projectList.nonEmpty) ProjectExec(projectList, 
plan) else plan
 
 writeAndCommit(job, description, committer) {
   val (planToExecute, concurrentOutputWriterSpec) = if (orderingMatched) {
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/V1Writes.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/V1Writes.scala
index b17d72b0f72..b1d2588ede6 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/V1Writes.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/V1Writes.scala
@@ -93,13 +93,8 @@ object V1Writes extends Rule[LogicalPlan] with SQLConfHelper 
{
   }
 
   private def prepareQuery(write: V1WriteCommand, query: LogicalPlan): 
LogicalPlan = {
-val hasEmpty2Null = query.exists(p => hasEmptyToNull(p.expressions))
-val empty2NullPlan = if (hasEmpty2Null) {
-  query
-} else {
-  val projectList = convertEmptyToNull(query.output, 
write.partitionColumns)
-  if (projectList.isEmpty) query else Project(projectList, query)
-}
+val projectList = convertEmptyToNull(query.output, write.partitionColumns)
+val empty2NullPlan = if (projectList.isEmpty) query else 
Project(projectList, query)
 assert(empty2NullPlan.output.length == query.output.length)
 val attrMap = AttributeMap(query.output.zip(empty2NullPlan.output))
 
@@ -108,7 +103,6 @@ object V1Writes extends Rule[LogicalPlan] with 
SQLConfHelper {
   case a: Attribute => attrMap.getOrElse(a, a)
 }.asInstanceOf[SortOrder])
 val outputOrdering = query.outputOrdering
-// Check if the ordering is already matched to ensure the idempotency of 
the rule.
 val orderingMatched = isOrderingMatched(requiredOrdering, outputOrdering)
 if (orderingMatched) {
   empty2NullPlan


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-41959][SQL] Improve v1 writes with empty2null

2023-02-20 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 547737b82df [SPARK-41959][SQL] Improve v1 writes with empty2null
547737b82df is described below

commit 547737b82dfee7e800930fd91bf2761263f68881
Author: ulysses-you 
AuthorDate: Mon Feb 20 16:41:09 2023 +0800

[SPARK-41959][SQL] Improve v1 writes with empty2null

### What changes were proposed in this pull request?

Cleanup some unnecessary `Empty2Null` related code

### Why are the changes needed?

V1Writes checked idempotency using WriteFiles, so it's unnecessary to check 
if empty2null if exists again.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

pass CI

Closes #39475 from ulysses-you/SPARK-41959.

Authored-by: ulysses-you 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/execution/datasources/FileFormatWriter.scala |  9 ++---
 .../org/apache/spark/sql/execution/datasources/V1Writes.scala  | 10 ++
 2 files changed, 4 insertions(+), 15 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
index 2491c9d7754..8321b1fac71 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
@@ -206,13 +206,8 @@ object FileFormatWriter extends Logging {
   partitionColumns: Seq[Attribute],
   sortColumns: Seq[Attribute],
   orderingMatched: Boolean): Set[String] = {
-val hasEmpty2Null = plan.exists(p => 
V1WritesUtils.hasEmptyToNull(p.expressions))
-val empty2NullPlan = if (hasEmpty2Null) {
-  plan
-} else {
-  val projectList = V1WritesUtils.convertEmptyToNull(plan.output, 
partitionColumns)
-  if (projectList.nonEmpty) ProjectExec(projectList, plan) else plan
-}
+val projectList = V1WritesUtils.convertEmptyToNull(plan.output, 
partitionColumns)
+val empty2NullPlan = if (projectList.nonEmpty) ProjectExec(projectList, 
plan) else plan
 
 writeAndCommit(job, description, committer) {
   val (planToExecute, concurrentOutputWriterSpec) = if (orderingMatched) {
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/V1Writes.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/V1Writes.scala
index b17d72b0f72..b1d2588ede6 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/V1Writes.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/V1Writes.scala
@@ -93,13 +93,8 @@ object V1Writes extends Rule[LogicalPlan] with SQLConfHelper 
{
   }
 
   private def prepareQuery(write: V1WriteCommand, query: LogicalPlan): 
LogicalPlan = {
-val hasEmpty2Null = query.exists(p => hasEmptyToNull(p.expressions))
-val empty2NullPlan = if (hasEmpty2Null) {
-  query
-} else {
-  val projectList = convertEmptyToNull(query.output, 
write.partitionColumns)
-  if (projectList.isEmpty) query else Project(projectList, query)
-}
+val projectList = convertEmptyToNull(query.output, write.partitionColumns)
+val empty2NullPlan = if (projectList.isEmpty) query else 
Project(projectList, query)
 assert(empty2NullPlan.output.length == query.output.length)
 val attrMap = AttributeMap(query.output.zip(empty2NullPlan.output))
 
@@ -108,7 +103,6 @@ object V1Writes extends Rule[LogicalPlan] with 
SQLConfHelper {
   case a: Attribute => attrMap.getOrElse(a, a)
 }.asInstanceOf[SortOrder])
 val outputOrdering = query.outputOrdering
-// Check if the ordering is already matched to ensure the idempotency of 
the rule.
 val orderingMatched = isOrderingMatched(requiredOrdering, outputOrdering)
 if (orderingMatched) {
   empty2NullPlan


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.4 updated: [SPARK-42398][SQL] Refine default column value DS v2 interface

2023-02-20 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 0c24adfda59 [SPARK-42398][SQL] Refine default column value DS v2 
interface
0c24adfda59 is described below

commit 0c24adfda5945f78aa19539c62d21a7efc265719
Author: Wenchen Fan 
AuthorDate: Mon Feb 20 16:30:50 2023 +0800

[SPARK-42398][SQL] Refine default column value DS v2 interface

### What changes were proposed in this pull request?

The current default value DS V2 API is a bit inconsistent. The 
`createTable` API only takes `StructType`, so implementations must know the 
special metadata key of the default value to access it. The `TableChange` API 
has the default value as an individual field.

This API adds a new `Column` interface, which holds both current default 
(as a SQL string) and exist default (as a v2 literal). `createTable` API now 
takes `Column`. This avoids the need of special metadata key and is also more 
extensible when adding more special cols like generated cols. This is also 
type-safe and makes sure the exist default is literal. The implementation is 
free to decide how to encode and store default values. Note: backward 
compatibility is taken care of.

### Why are the changes needed?

better DS v2 API for default value

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #40049 from cloud-fan/table2.

Lead-authored-by: Wenchen Fan 
Co-authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 70a098c83da4cff2bdc8d15a5a8b513a32564dbc)
Signed-off-by: Wenchen Fan 
---
 .../sql/connect/ProtoToParsedPlanTestSuite.scala   |   8 +-
 .../spark/sql/jdbc/v2/V2JDBCNamespaceTest.scala|   3 +-
 .../apache/spark/sql/connector/catalog/Column.java |  90 ++
 .../sql/connector/catalog/ColumnDefaultValue.java  |  84 +
 .../sql/connector/catalog/StagingTableCatalog.java |  67 --
 .../apache/spark/sql/connector/catalog/Table.java  |  11 +++
 .../spark/sql/connector/catalog/TableCatalog.java  |  23 -
 .../spark/sql/connector/catalog/TableChange.java   |  20 ++--
 .../spark/sql/catalyst/analysis/Analyzer.scala |   4 +-
 .../sql/catalyst/analysis/v2ResolutionPlans.scala  |   2 +-
 .../sql/catalyst/plans/logical/statements.scala|  13 +++
 .../plans/logical/v2AlterTableCommands.scala   |   4 +-
 .../catalyst/util/ResolveDefaultColumnsUtil.scala  |  30 --
 .../sql/connector/catalog/CatalogV2Implicits.scala |   7 ++
 .../sql/connector/catalog/CatalogV2Util.scala  |  89 +-
 .../connector/write/RowLevelOperationTable.scala   |   3 +-
 .../datasources/v2/DataSourceV2Relation.scala  |   3 +-
 .../spark/sql/internal/connector/ColumnImpl.scala  |  30 ++
 .../internal/connector/SimpleTableProvider.scala   |   3 +-
 .../spark/sql/connector/catalog/CatalogSuite.scala | 103 +++--
 .../sql/connector/catalog/CatalogV2UtilSuite.scala |   4 +-
 .../connector/catalog/InMemoryTableCatalog.scala   |  10 ++
 .../SupportsAtomicPartitionManagementSuite.scala   |   4 +-
 .../catalog/SupportsPartitionManagementSuite.scala |   7 +-
 .../spark/sql/execution/datasources/rules.scala|   1 -
 .../execution/datasources/v2/CreateTableExec.scala |   7 +-
 .../datasources/v2/DataSourceV2Strategy.scala  |   8 +-
 .../datasources/v2/FileDataSourceV2.scala  |   3 +-
 .../datasources/v2/ReplaceTableExec.scala  |  13 ++-
 .../datasources/v2/ShowCreateTableExec.scala   |   3 +-
 .../datasources/v2/V2SessionCatalog.scala  |  11 ++-
 .../datasources/v2/WriteToDataSourceV2Exec.scala   |  22 +++--
 .../sources/TextSocketSourceProvider.scala |   5 +-
 .../spark/sql/streaming/DataStreamReader.scala |   3 +-
 .../sql/connector/DataSourceV2DataFrameSuite.scala |   3 +-
 .../spark/sql/connector/DataSourceV2SQLSuite.scala |  20 ++--
 .../sql/connector/DeleteFromTableSuiteBase.scala   |   4 +-
 .../sql/connector/TestV2SessionCatalogBase.scala   |  11 ++-
 .../WriteDistributionAndOrderingSuite.scala|   2 +-
 .../execution/command/PlanResolutionSuite.scala|  56 ++-
 .../datasources/InMemoryTableMetricSuite.scala |   3 +-
 .../datasources/v2/V2SessionCatalogSuite.scala |  96 +--
 .../org/apache/spark/sql/hive/InsertSuite.scala|   6 +-
 43 files changed, 670 insertions(+), 229 deletions(-)

diff --git 
a/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/ProtoToParsedPlanTestSuite.scala
 
b/connector/connect/server/src/test/scala/org/apache/spark/sql/connect/ProtoToParsedPlanTestSuite.scala
index 18f656748ac..841017ae6c0 100644
--- 
a/connector/connect/server/src/test/scala/org/apache/spark/sql/co

[spark] branch master updated (5fc44dabe50 -> 70a098c83da)

2023-02-20 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 5fc44dabe50 [SPARK-42488][BUILD] Upgrade commons-crypto to 1.2.0
 add 70a098c83da [SPARK-42398][SQL] Refine default column value DS v2 
interface

No new revisions were added by this update.

Summary of changes:
 .../sql/connect/ProtoToParsedPlanTestSuite.scala   |   8 +-
 .../spark/sql/jdbc/v2/V2JDBCNamespaceTest.scala|   3 +-
 .../apache/spark/sql/connector/catalog/Column.java |  90 ++
 .../sql/connector/catalog/ColumnDefaultValue.java  |  84 +
 .../sql/connector/catalog/StagingTableCatalog.java |  67 --
 .../apache/spark/sql/connector/catalog/Table.java  |  11 +++
 .../spark/sql/connector/catalog/TableCatalog.java  |  23 -
 .../spark/sql/connector/catalog/TableChange.java   |  20 ++--
 .../spark/sql/catalyst/analysis/Analyzer.scala |   4 +-
 .../sql/catalyst/analysis/v2ResolutionPlans.scala  |   2 +-
 .../sql/catalyst/plans/logical/statements.scala|  13 +++
 .../plans/logical/v2AlterTableCommands.scala   |   4 +-
 .../catalyst/util/ResolveDefaultColumnsUtil.scala  |  30 --
 .../sql/connector/catalog/CatalogV2Implicits.scala |   7 ++
 .../sql/connector/catalog/CatalogV2Util.scala  |  89 +-
 .../connector/write/RowLevelOperationTable.scala   |   3 +-
 .../datasources/v2/DataSourceV2Relation.scala  |   3 +-
 ...pressionWithToString.scala => ColumnImpl.scala} |  15 ++-
 .../internal/connector/SimpleTableProvider.scala   |   3 +-
 .../spark/sql/connector/catalog/CatalogSuite.scala | 103 +++--
 .../sql/connector/catalog/CatalogV2UtilSuite.scala |   4 +-
 .../connector/catalog/InMemoryTableCatalog.scala   |  10 ++
 .../SupportsAtomicPartitionManagementSuite.scala   |   4 +-
 .../catalog/SupportsPartitionManagementSuite.scala |   7 +-
 .../spark/sql/execution/datasources/rules.scala|   1 -
 .../execution/datasources/v2/CreateTableExec.scala |   7 +-
 .../datasources/v2/DataSourceV2Strategy.scala  |   8 +-
 .../datasources/v2/FileDataSourceV2.scala  |   3 +-
 .../datasources/v2/ReplaceTableExec.scala  |  13 ++-
 .../datasources/v2/ShowCreateTableExec.scala   |   3 +-
 .../datasources/v2/V2SessionCatalog.scala  |  11 ++-
 .../datasources/v2/WriteToDataSourceV2Exec.scala   |  22 +++--
 .../sources/TextSocketSourceProvider.scala |   5 +-
 .../spark/sql/streaming/DataStreamReader.scala |   3 +-
 .../sql/connector/DataSourceV2DataFrameSuite.scala |   3 +-
 .../spark/sql/connector/DataSourceV2SQLSuite.scala |  20 ++--
 .../sql/connector/DeleteFromTableSuiteBase.scala   |   4 +-
 .../sql/connector/TestV2SessionCatalogBase.scala   |  11 ++-
 .../WriteDistributionAndOrderingSuite.scala|   2 +-
 .../execution/command/PlanResolutionSuite.scala|  56 ++-
 .../datasources/InMemoryTableMetricSuite.scala |   3 +-
 .../datasources/v2/V2SessionCatalogSuite.scala |  96 +--
 .../org/apache/spark/sql/hive/InsertSuite.scala|   6 +-
 43 files changed, 650 insertions(+), 234 deletions(-)
 create mode 100644 
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Column.java
 create mode 100644 
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ColumnDefaultValue.java
 copy 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/connector/{ExpressionWithToString.scala
 => ColumnImpl.scala} (69%)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

43 matches

Mail list logo