[spark] branch master updated: [SPARK-39616][BUILD][ML][FOLLOWUP] Fix flaky doctests

2022-07-05 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6f37986cc15 [SPARK-39616][BUILD][ML][FOLLOWUP] Fix flaky doctests
6f37986cc15 is described below

commit 6f37986cc155cae71957c314a7e2c7c848d94ff2
Author: Ruifeng Zheng 
AuthorDate: Tue Jul 5 22:42:35 2022 -0700

[SPARK-39616][BUILD][ML][FOLLOWUP] Fix flaky doctests

### What changes were proposed in this pull request?
Skip flaky doctests

### Why are the changes needed?
```
File "/__w/spark/spark/python/pyspark/mllib/linalg/distributed.py", line 
859, in __main__.IndexedRowMatrix.computeSVD
Failed example:
svd_model.V # doctest: +ELLIPSIS
Expected:
DenseMatrix(3, 2, [-0.4082, -0.8165, -0.4082, 0.8944, -0.4472, 0.0], 0)
Got:
DenseMatrix(3, 2, [-0.4082, -0.8165, -0.4082, 0.8944, -0.4472, -0.0], 0)
**
File "/__w/spark/spark/python/pyspark/mllib/linalg/distributed.py", line 
426, in __main__.RowMatrix.computeSVD
Failed example:
svd_model.V # doctest: +ELLIPSIS
Expected:
DenseMatrix(3, 2, [-0.4082, -0.8165, -0.4082, 0.8944, -0.4472, 0.0], 0)
Got:
DenseMatrix(3, 2, [-0.4082, -0.8165, -0.4082, 0.8944, -0.4472, -0.0], 0)
**
   1 of   6 in __main__.IndexedRowMatrix.computeSVD
   1 of   6 in __main__.RowMatrix.computeSVD
***Test Failed*** 2 failures.
Had test failures in pyspark.mllib.linalg.distributed with python3.9; see 
logs.
```

https://github.com/apache/spark/pull/37002 occasionally cause above tests 
output `-0.0` instead of `0.0`, I think they are both acceptable.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
updated doctests

Closes #37097 from zhengruifeng/build_breeze_followup.

Authored-by: Ruifeng Zheng 
Signed-off-by: Dongjoon Hyun 
---
 python/pyspark/mllib/linalg/distributed.py | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/python/pyspark/mllib/linalg/distributed.py 
b/python/pyspark/mllib/linalg/distributed.py
index 40a247da1e6..1a2e38f81e7 100644
--- a/python/pyspark/mllib/linalg/distributed.py
+++ b/python/pyspark/mllib/linalg/distributed.py
@@ -423,8 +423,8 @@ class RowMatrix(DistributedMatrix):
 [DenseVector([-0.7071, 0.7071]), DenseVector([-0.7071, -0.7071])]
 >>> svd_model.s
 DenseVector([3.4641, 3.1623])
->>> svd_model.V # doctest: +ELLIPSIS
-DenseMatrix(3, 2, [-0.4082, -0.8165, -0.4082, 0.8944, -0.4472, 0.0], 0)
+>>> svd_model.V
+DenseMatrix(3, 2, [-0.4082, -0.8165, -0.4082, 0.8944, -0.4472, 
...0.0], 0)
 """
 j_model = self._java_matrix_wrapper.call("computeSVD", int(k), 
bool(computeU), float(rCond))
 return SingularValueDecomposition(j_model)
@@ -857,8 +857,8 @@ class IndexedRowMatrix(DistributedMatrix):
 IndexedRow(1, [-0.707106781187,-0.707106781187])]
 >>> svd_model.s
 DenseVector([3.4641, 3.1623])
->>> svd_model.V # doctest: +ELLIPSIS
-DenseMatrix(3, 2, [-0.4082, -0.8165, -0.4082, 0.8944, -0.4472, 0.0], 0)
+>>> svd_model.V
+DenseMatrix(3, 2, [-0.4082, -0.8165, -0.4082, 0.8944, -0.4472, 
...0.0], 0)
 """
 j_model = self._java_matrix_wrapper.call("computeSVD", int(k), 
bool(computeU), float(rCond))
 return SingularValueDecomposition(j_model)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (8adc8dd84b4 -> c1d1ec5f3bd)

2022-07-05 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 8adc8dd84b4 [SPARK-39687][PYTHON][DOCS] Make sure new catalog methods 
listed in API reference
 add c1d1ec5f3bd [SPARK-39522][INFRA] Add Apache Spark infra GA image cache

No new revisions were added by this update.

Summary of changes:
 .github/workflows/build_infra_images_cache.yml | 63 ++
 dev/infra/Dockerfile   | 54 ++
 2 files changed, 117 insertions(+)
 create mode 100644 .github/workflows/build_infra_images_cache.yml
 create mode 100644 dev/infra/Dockerfile


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-39687][PYTHON][DOCS] Make sure new catalog methods listed in API reference

2022-07-05 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8adc8dd84b4 [SPARK-39687][PYTHON][DOCS] Make sure new catalog methods 
listed in API reference
8adc8dd84b4 is described below

commit 8adc8dd84b4d3567efa71ad0b924bab580af8999
Author: Ruifeng Zheng 
AuthorDate: Wed Jul 6 08:25:38 2022 +0900

[SPARK-39687][PYTHON][DOCS] Make sure new catalog methods listed in API 
reference

### What changes were proposed in this pull request?
1, add new methods to `catalog.rst`;
2, follow sphinx synctax `AnalysisException` -> 
``:class:`AnalysisException``

### Why are the changes needed?
Make sure new catalog methods listed in API reference

### Does this PR introduce _any_ user-facing change?
No, docs only

### How was this patch tested?
manually doc build and check, such as


![image](https://user-images.githubusercontent.com/7322292/177324325-63280b21-9e68-4780-a095-84bf9fd404ac.png)


![image](https://user-images.githubusercontent.com/7322292/177323978-59c527f5-108f-4559-821a-fa9480d76c1f.png)


![image](https://user-images.githubusercontent.com/7322292/177324089-669fdb2f-8696-4e50-b725-0af9f8516365.png)

Closes #37092 from zhengruifeng/py_fix_catalog_doc.

Authored-by: Ruifeng Zheng 
Signed-off-by: Hyukjin Kwon 
---
 python/docs/source/reference/pyspark.sql/catalog.rst | 6 ++
 python/pyspark/sql/catalog.py| 6 +++---
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/python/docs/source/reference/pyspark.sql/catalog.rst 
b/python/docs/source/reference/pyspark.sql/catalog.rst
index 8267e06410e..742af104dfb 100644
--- a/python/docs/source/reference/pyspark.sql/catalog.rst
+++ b/python/docs/source/reference/pyspark.sql/catalog.rst
@@ -29,12 +29,17 @@ Catalog
 Catalog.clearCache
 Catalog.createExternalTable
 Catalog.createTable
+Catalog.currentCatalog
 Catalog.currentDatabase
 Catalog.databaseExists
 Catalog.dropGlobalTempView
 Catalog.dropTempView
 Catalog.functionExists
+Catalog.getDatabase
+Catalog.getFunction
+Catalog.getTable
 Catalog.isCached
+Catalog.listCatalogs
 Catalog.listColumns
 Catalog.listDatabases
 Catalog.listFunctions
@@ -43,6 +48,7 @@ Catalog
 Catalog.refreshByPath
 Catalog.refreshTable
 Catalog.registerFunction
+Catalog.setCurrentCatalog
 Catalog.setCurrentDatabase
 Catalog.tableExists
 Catalog.uncacheTable
diff --git a/python/pyspark/sql/catalog.py b/python/pyspark/sql/catalog.py
index 7efaf14eb82..548750d7120 100644
--- a/python/pyspark/sql/catalog.py
+++ b/python/pyspark/sql/catalog.py
@@ -152,7 +152,7 @@ class Catalog:
 
 def getDatabase(self, dbName: str) -> Database:
 """Get the database with the specified name.
-This throws an AnalysisException when the database cannot be found.
+This throws an :class:`AnalysisException` when the database cannot be 
found.
 
 .. versionadded:: 3.4.0
 
@@ -244,7 +244,7 @@ class Catalog:
 
 def getTable(self, tableName: str) -> Table:
 """Get the table or view with the specified name. This table can be a 
temporary view or a
-table/view. This throws an AnalysisException when no Table can be 
found.
+table/view. This throws an :class:`AnalysisException` when no Table 
can be found.
 
 .. versionadded:: 3.4.0
 
@@ -363,7 +363,7 @@ class Catalog:
 
 def getFunction(self, functionName: str) -> Function:
 """Get the function with the specified name. This function can be a 
temporary function or a
-function. This throws an AnalysisException when the function cannot be 
found.
+function. This throws an :class:`AnalysisException` when the function 
cannot be found.
 
 .. versionadded:: 3.4.0
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (79f133b7bbc -> 5cccefaf3a9)

2022-07-05 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 79f133b7bbc [SPARK-39688][K8S] `getReusablePVCs` should handle 
accounts with no PVC permission
 add 5cccefaf3a9 [SPARK-39686][INFRA] Disable scheduled builds that did not 
pass even once

No new revisions were added by this update.

Summary of changes:
 .github/workflows/build_branch32.yml | 8 
 .github/workflows/build_hadoop2.yml  | 4 ++--
 2 files changed, 6 insertions(+), 6 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-39688][K8S] `getReusablePVCs` should handle accounts with no PVC permission

2022-07-05 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 79f133b7bbc [SPARK-39688][K8S] `getReusablePVCs` should handle 
accounts with no PVC permission
79f133b7bbc is described below

commit 79f133b7bbc1d9aa6a20dd8a34ec120902f96155
Author: Dongjoon Hyun 
AuthorDate: Tue Jul 5 13:26:43 2022 -0700

[SPARK-39688][K8S] `getReusablePVCs` should handle accounts with no PVC 
permission

### What changes were proposed in this pull request?

This PR aims to handle `KubernetesClientException` in `getReusablePVCs` 
method to handle gracefully the cases where accounts has no PVC permission 
including `listing`.

### Why are the changes needed?

To prevent a regression in Apache Spark 3.4.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs with the newly added test case.

Closes #37095 from dongjoon-hyun/SPARK-39688.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../cluster/k8s/ExecutorPodsAllocator.scala| 28 +-
 .../cluster/k8s/ExecutorPodsAllocatorSuite.scala   | 10 +++-
 2 files changed, 26 insertions(+), 12 deletions(-)

diff --git 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala
 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala
index 3519efd3fcb..9bdc30e4466 100644
--- 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala
+++ 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala
@@ -25,7 +25,7 @@ import scala.collection.mutable
 import scala.util.control.NonFatal
 
 import io.fabric8.kubernetes.api.model.{HasMetadata, PersistentVolumeClaim, 
Pod, PodBuilder}
-import io.fabric8.kubernetes.client.KubernetesClient
+import io.fabric8.kubernetes.client.{KubernetesClient, 
KubernetesClientException}
 
 import org.apache.spark.{SecurityManager, SparkConf, SparkException}
 import org.apache.spark.deploy.k8s.Config._
@@ -360,16 +360,22 @@ class ExecutorPodsAllocator(
   private def getReusablePVCs(applicationId: String, pvcsInUse: Seq[String]) = 
{
 if (conf.get(KUBERNETES_DRIVER_OWN_PVC) && 
conf.get(KUBERNETES_DRIVER_REUSE_PVC) &&
 driverPod.nonEmpty) {
-  val createdPVCs = kubernetesClient
-.persistentVolumeClaims
-.withLabel("spark-app-selector", applicationId)
-.list()
-.getItems
-.asScala
-
-  val reusablePVCs = createdPVCs.filterNot(pvc => 
pvcsInUse.contains(pvc.getMetadata.getName))
-  logInfo(s"Found ${reusablePVCs.size} reusable PVCs from 
${createdPVCs.size} PVCs")
-  reusablePVCs
+  try {
+val createdPVCs = kubernetesClient
+  .persistentVolumeClaims
+  .withLabel("spark-app-selector", applicationId)
+  .list()
+  .getItems
+  .asScala
+
+val reusablePVCs = createdPVCs.filterNot(pvc => 
pvcsInUse.contains(pvc.getMetadata.getName))
+logInfo(s"Found ${reusablePVCs.size} reusable PVCs from 
${createdPVCs.size} PVCs")
+reusablePVCs
+  } catch {
+case _: KubernetesClientException =>
+  logInfo("Cannot list PVC resources. Please check account 
permissions.")
+  mutable.Buffer.empty[PersistentVolumeClaim]
+  }
 } else {
   mutable.Buffer.empty[PersistentVolumeClaim]
 }
diff --git 
a/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocatorSuite.scala
 
b/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocatorSuite.scala
index 87bd8ef3d9d..7ce0b57d1e9 100644
--- 
a/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocatorSuite.scala
+++ 
b/resource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocatorSuite.scala
@@ -20,9 +20,10 @@ import java.time.Instant
 import java.util.concurrent.atomic.AtomicInteger
 
 import scala.collection.JavaConverters._
+import scala.collection.mutable
 
 import io.fabric8.kubernetes.api.model._
-import io.fabric8.kubernetes.client.KubernetesClient
+import io.fabric8.kubernetes.client.{KubernetesClient, 
KubernetesClientException}
 import io.fabric8.kubernetes.client.dsl.PodResource
 import org.mockito.{Mock, MockitoAnnotations}
 import org.mockito.ArgumentMatchers.{any, eq => meq}
@@ -762,6 +763,13 @@ class ExecutorPodsAllocatorSuite extends SparkFunSuite 
with BeforeAndAfter {
   " namespace default"))
   }
 
+  test("SPARK-39688: getReusablePVCs should 

[spark] branch branch-3.3 updated (2edd344392a -> f9e3668dbb1)

2022-07-05 Thread huaxingao
This is an automated email from the ASF dual-hosted git repository.

huaxingao pushed a change to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


from 2edd344392a [SPARK-39611][PYTHON][PS] Fix wrong aliases in 
__array_ufunc__
 add f9e3668dbb1 [SPARK-39656][SQL][3.3] Fix wrong namespace in 
DescribeNamespaceExec

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/execution/datasources/v2/DescribeNamespaceExec.scala | 3 ++-
 .../apache/spark/sql/execution/command/v2/DescribeNamespaceSuite.scala | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.2 updated (3d084fe3217 -> 1c0bd4c15a2)

2022-07-05 Thread huaxingao
This is an automated email from the ASF dual-hosted git repository.

huaxingao pushed a change to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


from 3d084fe3217 [SPARK-39677][SQL][DOCS][3.2] Fix args formatting of the 
regexp and like functions
 add 1c0bd4c15a2 [SPARK-39656][SQL][3.2] Fix wrong namespace in 
DescribeNamespaceExec

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/execution/datasources/v2/DescribeNamespaceExec.scala  | 3 ++-
 .../scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala | 6 +++---
 2 files changed, 5 insertions(+), 4 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (ecbfff0efe2 -> 40e00b883f8)

2022-07-05 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from ecbfff0efe2 [SPARK-39610][INFRA] Add GITHUB_WORKSPACE to git trust 
safe.directory for container based job
 add 40e00b883f8 [SPARK-39616][BUILD][ML] Upgrade Breeze to 2.0

No new revisions were added by this update.

Summary of changes:
 R/run-tests.sh |  4 ++--
 dev/deps/spark-deps-hadoop-2-hive-2.3  |  8 +++-
 dev/deps/spark-deps-hadoop-3-hive-2.3  |  8 +++-
 .../regression/GeneralizedLinearRegression.scala   |  1 +
 .../spark/ml/regression/LinearRegression.scala |  1 +
 .../mllib/classification/NaiveBayesSuite.scala |  1 +
 pom.xml| 22 +-
 python/pyspark/mllib/linalg/distributed.py |  4 ++--
 python/run-tests.py|  3 ++-
 9 files changed, 16 insertions(+), 36 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.2 updated (6ae97e26bda -> 3d084fe3217)

2022-07-05 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a change to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


from 6ae97e26bda [SPARK-39611][PYTHON][PS] Fix wrong aliases in 
__array_ufunc__
 add 3d084fe3217 [SPARK-39677][SQL][DOCS][3.2] Fix args formatting of the 
regexp and like functions

No new revisions were added by this update.

Summary of changes:
 .../catalyst/expressions/regexpExpressions.scala   | 36 --
 1 file changed, 12 insertions(+), 24 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.1 updated (cc8ab362798 -> 07f5926a6c3)

2022-07-05 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a change to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


from cc8ab362798 [SPARK-39656][SQL][3.1] Fix wrong namespace in 
DescribeNamespaceExec
 add 07f5926a6c3 [SPARK-39677][SQL][DOCS][3.1] Fix args formatting of the 
regexp and like functions

No new revisions were added by this update.

Summary of changes:
 .../catalyst/expressions/regexpExpressions.scala   | 36 --
 1 file changed, 12 insertions(+), 24 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-39610][INFRA] Add GITHUB_WORKSPACE to git trust safe.directory for container based job

2022-07-05 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new ecbfff0efe2 [SPARK-39610][INFRA] Add GITHUB_WORKSPACE to git trust 
safe.directory for container based job
ecbfff0efe2 is described below

commit ecbfff0efe2ba92068554e1434d90ec6dbec8248
Author: Yikun Jiang 
AuthorDate: Tue Jul 5 20:54:07 2022 +0900

[SPARK-39610][INFRA] Add GITHUB_WORKSPACE to git trust safe.directory for 
container based job

### What changes were proposed in this pull request?
This patch add GITHUB_WORKSPACE to git trust safe.directory for container 
based job.

There are 3 container based job in Spark Infra:
- sparkr
- lint
- pyspark

### Why are the changes needed?

When upgrade `git >= 2.35.2` (such as latest dev docker image in my case), 
fix a 
[CVE-2022-24765](https://github.blog/2022-04-12-git-security-vulnerability-announced/#cve-2022-24765)
 , and it has been backported [latest ubuntu 
git](https://github.com/actions/checkout/issues/760#issuecomment-1099355820)

The `GITHUB_WORKSPACE` is belong to action user (in host), but the 
container user is `root` (in container) (root is required in our spark job), so 
cause the error like:
```
fatal: unsafe repository ('/__w/spark/spark' is owned by someone else)
To add an exception for this directory, call:
git config --global --add safe.directory /__w/spark/spark
fatal: unsafe repository ('/__w/spark/spark' is owned by someone else)
To add an exception for this directory, call:
git config --global --add safe.directory /__w/spark/spark
```

The solution is from `action/checkout` 
https://github.com/actions/checkout/issues/760 , just add `GITHUB_WORKSPACE` to 
git trust safe.directory to make container user can checkout successfully.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed.
CI passed with latest ubuntu hosted runner, here is a simple e2e test for 
pyspark job https://github.com/apache/spark/pull/37005

Closes #37079 from Yikun/SPARK-39610.

Authored-by: Yikun Jiang 
Signed-off-by: Hyukjin Kwon 
---
 .github/workflows/build_and_test.yml | 9 +
 1 file changed, 9 insertions(+)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index ef2235c3749..ca375be2777 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -289,6 +289,9 @@ jobs:
 fetch-depth: 0
 repository: apache/spark
 ref: ${{ inputs.branch }}
+- name: Add GITHUB_WORKSPACE to git trust safe.directory
+  run: |
+git config --global --add safe.directory ${GITHUB_WORKSPACE}
 - name: Sync the current branch with the latest in Apache Spark
   if: github.repository != 'apache/spark'
   run: |
@@ -374,6 +377,9 @@ jobs:
 fetch-depth: 0
 repository: apache/spark
 ref: ${{ inputs.branch }}
+- name: Add GITHUB_WORKSPACE to git trust safe.directory
+  run: |
+git config --global --add safe.directory ${GITHUB_WORKSPACE}
 - name: Sync the current branch with the latest in Apache Spark
   if: github.repository != 'apache/spark'
   run: |
@@ -439,6 +445,9 @@ jobs:
 fetch-depth: 0
 repository: apache/spark
 ref: ${{ inputs.branch }}
+- name: Add GITHUB_WORKSPACE to git trust safe.directory
+  run: |
+git config --global --add safe.directory ${GITHUB_WORKSPACE}
 - name: Sync the current branch with the latest in Apache Spark
   if: github.repository != 'apache/spark'
   run: |


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-39579][PYTHON][FOLLOWUP] fix functionExists(functionName, dbName) when dbName is not None

2022-07-05 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 1100d75f53c [SPARK-39579][PYTHON][FOLLOWUP] fix 
functionExists(functionName, dbName) when dbName is not None
1100d75f53c is described below

commit 1100d75f53c16f44dd414b8a0be477760420507d
Author: Ruifeng Zheng 
AuthorDate: Tue Jul 5 19:53:13 2022 +0800

[SPARK-39579][PYTHON][FOLLOWUP] fix functionExists(functionName, dbName) 
when dbName is not None

### What changes were proposed in this pull request?
fix functionExists(functionName, dbName)

### Why are the changes needed?
https://github.com/apache/spark/pull/36977 introduce a bug in 
`functionExists(functionName, dbName)`, when dbName is not None, should call 
`self._jcatalog.functionExists(dbName, functionName)`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existing testsuite

Closes #37088 from zhengruifeng/py_3l_fix_functionExists.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/catalog.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/python/pyspark/sql/catalog.py b/python/pyspark/sql/catalog.py
index 42c040c284b..7efaf14eb82 100644
--- a/python/pyspark/sql/catalog.py
+++ b/python/pyspark/sql/catalog.py
@@ -359,7 +359,7 @@ class Catalog:
 "a future version. Use functionExists(`dbName.tableName`) 
instead.",
 FutureWarning,
 )
-return self._jcatalog.functionExists(self.currentDatabase(), 
functionName)
+return self._jcatalog.functionExists(dbName, functionName)
 
 def getFunction(self, functionName: str) -> Function:
 """Get the function with the specified name. This function can be a 
temporary function or a


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.2 updated: [SPARK-39611][PYTHON][PS] Fix wrong aliases in __array_ufunc__

2022-07-05 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 6ae97e26bda [SPARK-39611][PYTHON][PS] Fix wrong aliases in 
__array_ufunc__
6ae97e26bda is described below

commit 6ae97e26bdaaed1c243441170d70e04cc9aa2e89
Author: Yikun Jiang 
AuthorDate: Tue Jul 5 20:52:36 2022 +0900

[SPARK-39611][PYTHON][PS] Fix wrong aliases in __array_ufunc__

### What changes were proposed in this pull request?
This PR fix the wrong aliases in `__array_ufunc__`

### Why are the changes needed?
When running test with numpy 1.23.0 (current latest), hit a bug: 
`NotImplementedError: pandas-on-Spark objects currently do not support .`

In `__array_ufunc__` we first call `maybe_dispatch_ufunc_to_dunder_op` to 
try dunder methods first, and then we try pyspark API. 
`maybe_dispatch_ufunc_to_dunder_op` is from pandas code.

pandas fix a bug 
https://github.com/pandas-dev/pandas/pull/44822#issuecomment-991166419 
https://github.com/pandas-dev/pandas/pull/44822/commits/206b2496bc6f6aa025cb26cb42f52abeec227741
 when upgrade to numpy 1.23.0, we need to also sync this.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Current CI passed
- The exsiting UT `test_series_datetime` already cover this, I also test it 
in my local env with 1.23.0
```shell
pip install "numpy==1.23.0"
python/run-tests --testnames 'pyspark.pandas.tests.test_series_datetime 
SeriesDateTimeTest.test_arithmetic_op_exceptions'
```

Closes #37078 from Yikun/SPARK-39611.

Authored-by: Yikun Jiang 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit fb48a14a67940b9270390b8ce74c19ae58e2880e)
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/pandas/numpy_compat.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/python/pyspark/pandas/numpy_compat.py 
b/python/pyspark/pandas/numpy_compat.py
index d487a1bebf6..e6e7a681aa9 100644
--- a/python/pyspark/pandas/numpy_compat.py
+++ b/python/pyspark/pandas/numpy_compat.py
@@ -157,7 +157,7 @@ def maybe_dispatch_ufunc_to_dunder_op(
 "true_divide": "truediv",
 "power": "pow",
 "remainder": "mod",
-"divide": "div",
+"divide": "truediv",
 "equal": "eq",
 "not_equal": "ne",
 "less": "lt",


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.3 updated: [SPARK-39611][PYTHON][PS] Fix wrong aliases in __array_ufunc__

2022-07-05 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 2edd344392a [SPARK-39611][PYTHON][PS] Fix wrong aliases in 
__array_ufunc__
2edd344392a is described below

commit 2edd344392a5ddb44f97449b8ad3c6292eb334e3
Author: Yikun Jiang 
AuthorDate: Tue Jul 5 20:52:36 2022 +0900

[SPARK-39611][PYTHON][PS] Fix wrong aliases in __array_ufunc__

### What changes were proposed in this pull request?
This PR fix the wrong aliases in `__array_ufunc__`

### Why are the changes needed?
When running test with numpy 1.23.0 (current latest), hit a bug: 
`NotImplementedError: pandas-on-Spark objects currently do not support .`

In `__array_ufunc__` we first call `maybe_dispatch_ufunc_to_dunder_op` to 
try dunder methods first, and then we try pyspark API. 
`maybe_dispatch_ufunc_to_dunder_op` is from pandas code.

pandas fix a bug 
https://github.com/pandas-dev/pandas/pull/44822#issuecomment-991166419 
https://github.com/pandas-dev/pandas/pull/44822/commits/206b2496bc6f6aa025cb26cb42f52abeec227741
 when upgrade to numpy 1.23.0, we need to also sync this.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Current CI passed
- The exsiting UT `test_series_datetime` already cover this, I also test it 
in my local env with 1.23.0
```shell
pip install "numpy==1.23.0"
python/run-tests --testnames 'pyspark.pandas.tests.test_series_datetime 
SeriesDateTimeTest.test_arithmetic_op_exceptions'
```

Closes #37078 from Yikun/SPARK-39611.

Authored-by: Yikun Jiang 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit fb48a14a67940b9270390b8ce74c19ae58e2880e)
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/pandas/numpy_compat.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/python/pyspark/pandas/numpy_compat.py 
b/python/pyspark/pandas/numpy_compat.py
index ea72fa658e4..f9b7bd67a9b 100644
--- a/python/pyspark/pandas/numpy_compat.py
+++ b/python/pyspark/pandas/numpy_compat.py
@@ -166,7 +166,7 @@ def maybe_dispatch_ufunc_to_dunder_op(
 "true_divide": "truediv",
 "power": "pow",
 "remainder": "mod",
-"divide": "div",
+"divide": "truediv",
 "equal": "eq",
 "not_equal": "ne",
 "less": "lt",


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-39611][PYTHON][PS] Fix wrong aliases in __array_ufunc__

2022-07-05 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new fb48a14a679 [SPARK-39611][PYTHON][PS] Fix wrong aliases in 
__array_ufunc__
fb48a14a679 is described below

commit fb48a14a67940b9270390b8ce74c19ae58e2880e
Author: Yikun Jiang 
AuthorDate: Tue Jul 5 20:52:36 2022 +0900

[SPARK-39611][PYTHON][PS] Fix wrong aliases in __array_ufunc__

### What changes were proposed in this pull request?
This PR fix the wrong aliases in `__array_ufunc__`

### Why are the changes needed?
When running test with numpy 1.23.0 (current latest), hit a bug: 
`NotImplementedError: pandas-on-Spark objects currently do not support .`

In `__array_ufunc__` we first call `maybe_dispatch_ufunc_to_dunder_op` to 
try dunder methods first, and then we try pyspark API. 
`maybe_dispatch_ufunc_to_dunder_op` is from pandas code.

pandas fix a bug 
https://github.com/pandas-dev/pandas/pull/44822#issuecomment-991166419 
https://github.com/pandas-dev/pandas/pull/44822/commits/206b2496bc6f6aa025cb26cb42f52abeec227741
 when upgrade to numpy 1.23.0, we need to also sync this.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Current CI passed
- The exsiting UT `test_series_datetime` already cover this, I also test it 
in my local env with 1.23.0
```shell
pip install "numpy==1.23.0"
python/run-tests --testnames 'pyspark.pandas.tests.test_series_datetime 
SeriesDateTimeTest.test_arithmetic_op_exceptions'
```

Closes #37078 from Yikun/SPARK-39611.

Authored-by: Yikun Jiang 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/pandas/numpy_compat.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/python/pyspark/pandas/numpy_compat.py 
b/python/pyspark/pandas/numpy_compat.py
index ea72fa658e4..f9b7bd67a9b 100644
--- a/python/pyspark/pandas/numpy_compat.py
+++ b/python/pyspark/pandas/numpy_compat.py
@@ -166,7 +166,7 @@ def maybe_dispatch_ufunc_to_dunder_op(
 "true_divide": "truediv",
 "power": "pow",
 "remainder": "mod",
-"divide": "div",
+"divide": "truediv",
 "equal": "eq",
 "not_equal": "ne",
 "less": "lt",


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.3 updated: [SPARK-39612][SQL][TESTS] DataFrame.exceptAll followed by count should work

2022-07-05 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 364a4f52610 [SPARK-39612][SQL][TESTS] DataFrame.exceptAll followed by 
count should work
364a4f52610 is described below

commit 364a4f52610fdacdefc2d16af984900c55f8e31b
Author: Hyukjin Kwon 
AuthorDate: Tue Jul 5 20:44:43 2022 +0900

[SPARK-39612][SQL][TESTS] DataFrame.exceptAll followed by count should work

### What changes were proposed in this pull request?

This PR adds a test case broken by 
https://github.com/apache/spark/commit/4b9343593eca780ca30ffda45244a71413577884 
which was reverted in 
https://github.com/apache/spark/commit/161c596cafea9c235b5c918d8999c085401d73a9.

### Why are the changes needed?

To prevent a regression in the future. This was a regression in Apache 
Spark 3.3 that used to work in Apache Spark 3.2.

### Does this PR introduce _any_ user-facing change?

Yes, it makes `DataFrame.exceptAll` followed by `count` working.

### How was this patch tested?

The unit test was added.

Closes #37084 from HyukjinKwon/SPARK-39612.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 947e271402f749f6f58b79fecd59279eaf86db57)
Signed-off-by: Hyukjin Kwon 
---
 sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala | 5 +
 1 file changed, 5 insertions(+)

diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
index 728ba3d6456..a4651c913c6 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
@@ -3215,6 +3215,11 @@ class DataFrameSuite extends QueryTest
   }
 }
   }
+
+  test("SPARK-39612: exceptAll with following count should work") {
+val d1 = Seq("a").toDF
+assert(d1.exceptAll(d1).count() === 0)
+  }
 }
 
 case class GroupByKey(a: Int, b: Int)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-39612][SQL][TESTS] DataFrame.exceptAll followed by count should work

2022-07-05 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 947e271402f [SPARK-39612][SQL][TESTS] DataFrame.exceptAll followed by 
count should work
947e271402f is described below

commit 947e271402f749f6f58b79fecd59279eaf86db57
Author: Hyukjin Kwon 
AuthorDate: Tue Jul 5 20:44:43 2022 +0900

[SPARK-39612][SQL][TESTS] DataFrame.exceptAll followed by count should work

### What changes were proposed in this pull request?

This PR adds a test case broken by 
https://github.com/apache/spark/commit/4b9343593eca780ca30ffda45244a71413577884 
which was reverted in 
https://github.com/apache/spark/commit/161c596cafea9c235b5c918d8999c085401d73a9.

### Why are the changes needed?

To prevent a regression in the future. This was a regression in Apache 
Spark 3.3 that used to work in Apache Spark 3.2.

### Does this PR introduce _any_ user-facing change?

Yes, it makes `DataFrame.exceptAll` followed by `count` working.

### How was this patch tested?

The unit test was added.

Closes #37084 from HyukjinKwon/SPARK-39612.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala | 5 +
 1 file changed, 5 insertions(+)

diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
index 4daa0a1b3b6..41593c701a7 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
@@ -3239,6 +3239,11 @@ class DataFrameSuite extends QueryTest
   }
 }
   }
+
+  test("SPARK-39612: exceptAll with following count should work") {
+val d1 = Seq("a").toDF
+assert(d1.exceptAll(d1).count() === 0)
+  }
 }
 
 case class GroupByKey(a: Int, b: Int)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (4e42f8b12e8 -> 12698625b7e)

2022-07-05 Thread yumwang
This is an automated email from the ASF dual-hosted git repository.

yumwang pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 4e42f8b12e8 [SPARK-39677][SQL][DOCS] Fix args formatting of the regexp 
and like functions
 add 12698625b7e [SPARK-39606][SQL] Use child stats to estimate order 
operator

No new revisions were added by this update.

Summary of changes:
 .../plans/logical/statsEstimation/BasicStatsPlanVisitor.scala | 4 +---
 .../logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala | 2 +-
 .../sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala  | 2 +-
 3 files changed, 3 insertions(+), 5 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.3 updated: [SPARK-39677][SQL][DOCS] Fix args formatting of the regexp and like functions

2022-07-05 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 2069fd03fd3 [SPARK-39677][SQL][DOCS] Fix args formatting of the regexp 
and like functions
2069fd03fd3 is described below

commit 2069fd03fd30faaabd1d73ca0416a76ab5908937
Author: Max Gekk 
AuthorDate: Tue Jul 5 13:37:41 2022 +0300

[SPARK-39677][SQL][DOCS] Fix args formatting of the regexp and like 
functions

### What changes were proposed in this pull request?
In the PR, I propose to fix args formatting of some regexp functions by 
adding explicit new lines. That fixes the following items in arg lists.

Before:

https://user-images.githubusercontent.com/1580697/177274234-04209d43-a542-4c71-b5ca-6f3239208015.png;>

After:

https://user-images.githubusercontent.com/1580697/177280718-cb05184c-8559-4461-b94d-dfaaafda7dd2.png;>

### Why are the changes needed?
To improve readability of Spark SQL docs.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By building docs and checking manually:
```
$ SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 SKIP_RDOC=1 bundle exec jekyll build
```

Closes #37082 from MaxGekk/fix-regexp-docs.

Authored-by: Max Gekk 
Signed-off-by: Max Gekk 
(cherry picked from commit 4e42f8b12e8dc57a15998f22d508a19cf3c856aa)
Signed-off-by: Max Gekk 
---
 .../catalyst/expressions/regexpExpressions.scala   | 46 --
 1 file changed, 16 insertions(+), 30 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
index 01763f082d6..e3eea6f46e2 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
@@ -84,16 +84,12 @@ abstract class StringRegexExpression extends 
BinaryExpression
 Arguments:
   * str - a string expression
   * pattern - a string expression. The pattern is a string which is 
matched literally, with
-  exception to the following special symbols:
-
-  _ matches any one character in the input (similar to . in posix 
regular expressions)
-
+  exception to the following special symbols:
+  _ matches any one character in the input (similar to . in posix 
regular expressions)\
   % matches zero or more characters in the input (similar to .* in 
posix regular
-  expressions)
-
+  expressions)
   Since Spark 2.0, string literals are unescaped in our SQL parser. 
For example, in order
-  to match "\abc", the pattern should be "\\abc".
-
+  to match "\abc", the pattern should be "\\abc".
   When SQL config 'spark.sql.parser.escapedStringLiterals' is enabled, 
it falls back
   to Spark 1.6 behavior regarding string literal parsing. For example, 
if the config is
   enabled, the pattern to match "\abc" should be "\abc".
@@ -189,7 +185,7 @@ case class Like(left: Expression, right: Expression, 
escapeChar: Char)
 copy(left = newLeft, right = newRight)
 }
 
-// scalastyle:off line.contains.tab
+// scalastyle:off line.contains.tab line.size.limit
 /**
  * Simple RegEx case-insensitive pattern matching function
  */
@@ -200,16 +196,12 @@ case class Like(left: Expression, right: Expression, 
escapeChar: Char)
 Arguments:
   * str - a string expression
   * pattern - a string expression. The pattern is a string which is 
matched literally and
-  case-insensitively, with exception to the following special symbols:
-
-  _ matches any one character in the input (similar to . in posix 
regular expressions)
-
+  case-insensitively, with exception to the following special 
symbols:
+  _ matches any one character in the input (similar to . in posix 
regular expressions)
   % matches zero or more characters in the input (similar to .* in 
posix regular
-  expressions)
-
+  expressions)
   Since Spark 2.0, string literals are unescaped in our SQL parser. 
For example, in order
-  to match "\abc", the pattern should be "\\abc".
-
+  to match "\abc", the pattern should be "\\abc".
   When SQL config 'spark.sql.parser.escapedStringLiterals' is enabled, 
it falls back
   to Spark 1.6 behavior regarding string literal parsing. For example, 
if the config is
   enabled, the pattern to match "\abc" should be "\abc".
@@ -237,7 +229,7 @@ case class Like(left: Expression, right: Expression, 
escapeChar: Char)
   """,
   since = "3.3.0",
   group = 

[spark] branch master updated (161c596cafe -> 4e42f8b12e8)

2022-07-05 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 161c596cafe Revert "[SPARK-38531][SQL] Fix the condition of "Prune 
unrequired child index" branch of ColumnPruning"
 add 4e42f8b12e8 [SPARK-39677][SQL][DOCS] Fix args formatting of the regexp 
and like functions

No new revisions were added by this update.

Summary of changes:
 .../catalyst/expressions/regexpExpressions.scala   | 46 --
 1 file changed, 16 insertions(+), 30 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.3 updated: Revert "[SPARK-38531][SQL] Fix the condition of "Prune unrequired child index" branch of ColumnPruning"

2022-07-05 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 4512e094303 Revert "[SPARK-38531][SQL] Fix the condition of "Prune 
unrequired child index" branch of ColumnPruning"
4512e094303 is described below

commit 4512e0943036d30587ab19a95efb0e66b47dd746
Author: Hyukjin Kwon 
AuthorDate: Tue Jul 5 18:02:37 2022 +0900

Revert "[SPARK-38531][SQL] Fix the condition of "Prune unrequired child 
index" branch of ColumnPruning"

This reverts commit 17c56fc03b8e7269b293d6957c542eab9d723d52.
---
 .../catalyst/optimizer/NestedColumnAliasing.scala  | 19 -
 .../spark/sql/catalyst/optimizer/Optimizer.scala   | 15 +-
 .../catalyst/optimizer/ColumnPruningSuite.scala| 32 --
 3 files changed, 8 insertions(+), 58 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
index 6ba7907fdab..977e9b1ab13 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
@@ -314,25 +314,6 @@ object NestedColumnAliasing {
   }
 }
 
-object GeneratorUnrequiredChildrenPruning {
-  def unapply(plan: LogicalPlan): Option[LogicalPlan] = plan match {
-case p @ Project(_, g: Generate) =>
-  val requiredAttrs = p.references ++ g.generator.references
-  val newChild = ColumnPruning.prunedChild(g.child, requiredAttrs)
-  val unrequired = g.generator.references -- p.references
-  val unrequiredIndices = newChild.output.zipWithIndex.filter(t => 
unrequired.contains(t._1))
-.map(_._2)
-  if (!newChild.fastEquals(g.child) ||
-unrequiredIndices.toSet != g.unrequiredChildIndex.toSet) {
-Some(p.copy(child = g.copy(child = newChild, unrequiredChildIndex = 
unrequiredIndices)))
-  } else {
-None
-  }
-case _ => None
-  }
-}
-
-
 /**
  * This prunes unnecessary nested columns from [[Generate]], or [[Project]] -> 
[[Generate]]
  */
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
index 02f9a9eb01c..21903976656 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
@@ -842,12 +842,13 @@ object ColumnPruning extends Rule[LogicalPlan] {
   e.copy(child = prunedChild(child, e.references))
 
 // prune unrequired references
-// There are 2 types of pruning here:
-// 1. For attributes in g.child.outputSet that is not used by the 
generator nor the project,
-//we directly remove it from the output list of g.child.
-// 2. For attributes that is not used by the project but it is used by the 
generator, we put
-//it in g.unrequiredChildIndex to save memory usage.
-case GeneratorUnrequiredChildrenPruning(rewrittenPlan) => rewrittenPlan
+case p @ Project(_, g: Generate) if p.references != g.outputSet =>
+  val requiredAttrs = p.references -- g.producedAttributes ++ 
g.generator.references
+  val newChild = prunedChild(g.child, requiredAttrs)
+  val unrequired = g.generator.references -- p.references
+  val unrequiredIndices = newChild.output.zipWithIndex.filter(t => 
unrequired.contains(t._1))
+.map(_._2)
+  p.copy(child = g.copy(child = newChild, unrequiredChildIndex = 
unrequiredIndices))
 
 // prune unrequired nested fields from `Generate`.
 case GeneratorNestedColumnAliasing(rewrittenPlan) => rewrittenPlan
@@ -907,7 +908,7 @@ object ColumnPruning extends Rule[LogicalPlan] {
   })
 
   /** Applies a projection only when the child is producing unnecessary 
attributes */
-  def prunedChild(c: LogicalPlan, allReferences: AttributeSet): LogicalPlan =
+  private def prunedChild(c: LogicalPlan, allReferences: AttributeSet) =
 if (!c.outputSet.subsetOf(allReferences)) {
   Project(c.output.filter(allReferences.contains), c)
 } else {
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala
index 0101c855152..0655acbcb1b 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala
@@ -24,7 +24,6 @@ import org.apache.spark.sql.catalyst.dsl.expressions._
 import org.apache.spark.sql.catalyst.dsl.plans._

[spark] branch master updated (b37defef418 -> 161c596cafe)

2022-07-05 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from b37defef418 [SPARK-39564][SQL][FOLLOWUP] Consider the case of serde 
available in CatalogTable on explain string for LogicalRelation
 add 161c596cafe Revert "[SPARK-38531][SQL] Fix the condition of "Prune 
unrequired child index" branch of ColumnPruning"

No new revisions were added by this update.

Summary of changes:
 .../catalyst/optimizer/NestedColumnAliasing.scala  | 19 -
 .../spark/sql/catalyst/optimizer/Optimizer.scala   | 15 +-
 .../catalyst/optimizer/ColumnPruningSuite.scala| 32 --
 3 files changed, 8 insertions(+), 58 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-39564][SQL][FOLLOWUP] Consider the case of serde available in CatalogTable on explain string for LogicalRelation

2022-07-05 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b37defef418 [SPARK-39564][SQL][FOLLOWUP] Consider the case of serde 
available in CatalogTable on explain string for LogicalRelation
b37defef418 is described below

commit b37defef4183c51e64fd9629d26ca6aecd320a1b
Author: Jungtaek Lim 
AuthorDate: Tue Jul 5 16:47:34 2022 +0800

[SPARK-39564][SQL][FOLLOWUP] Consider the case of serde available in 
CatalogTable on explain string for LogicalRelation

### What changes were proposed in this pull request?

This PR is a follow-up of #36963.

With the change of #36963, LogicalRelation prints out the catalog table 
differently in explain on logical plan. It only prints out the qualified table 
name, and optionally also prints the class of serde if the information is 
available.

#36963 does not account the part which can be optionally written (serde 
class). While this does not break "current" tests, it would be ideal to address 
the issue in prior.

This PR proposes to introduce a new internal config to exclude serde on 
output of CatalogTable in SQL explain, which is intended to use only for test 
e.g. SQLQueryTestSuite.

### Why are the changes needed?

It could probably be a possible bug in future, though it would be just a 
test side.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing UTs. The internal test failed because of this (it exposed the 
serde class), and we fixed the test with this patch.

Closes #37042 from HeartSaVioR/SPARK-39564-followup.

Authored-by: Jungtaek Lim 
Signed-off-by: Wenchen Fan 
---
 .../main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala   | 6 +-
 .../src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala | 3 +++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala
index 8081a3edc81..71d8a0740bc 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala
@@ -886,7 +886,11 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] 
extends Product with Tre
 
   private def stringArgsForCatalogTable(table: CatalogTable): Seq[Any] = {
 table.storage.serde match {
-  case Some(serde) => table.identifier :: serde :: Nil
+  case Some(serde)
+// SPARK-39564: don't print out serde to avoid introducing complicated 
and error-prone
+// regex magic.
+if !SQLConf.get.getConfString("spark.test.noSerdeInExplain", 
"false").toBoolean =>
+table.identifier :: serde :: Nil
   case _ => table.identifier :: Nil
 }
   }
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala
index 0601dce1d4b..bd48d173039 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala
@@ -147,6 +147,9 @@ class SQLQueryTestSuite extends QueryTest with 
SharedSparkSession with SQLHelper
 .set(SQLConf.SHUFFLE_PARTITIONS, 4)
 // use Java 8 time API to handle negative years properly
 .set(SQLConf.DATETIME_JAVA8API_ENABLED, true)
+// SPARK-39564: don't print out serde to avoid introducing complicated and 
error-prone
+// regex magic.
+.set("spark.test.noSerdeInExplain", "true")
 
   // SPARK-32106 Since we add SQL test 'transform.sql' will use `cat` command,
   // here we need to ignore it.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.3 updated: [SPARK-39676][CORE][TESTS] Add task partition id for TaskInfo assertEquals method in JsonProtocolSuite

2022-07-05 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 3f969ada5fe [SPARK-39676][CORE][TESTS] Add task partition id for 
TaskInfo assertEquals method in JsonProtocolSuite
3f969ada5fe is described below

commit 3f969ada5fecddab272f2abbc849d2591f30f44c
Author: Qian.Sun 
AuthorDate: Tue Jul 5 15:40:44 2022 +0900

[SPARK-39676][CORE][TESTS] Add task partition id for TaskInfo assertEquals 
method in JsonProtocolSuite

### What changes were proposed in this pull request?

In https://github.com/apache/spark/pull/35185 , task partition id was added 
in taskInfo. And, JsonProtocolSuite#assertEquals about TaskInfo doesn't have 
partitionId.

### Why are the changes needed?

Should assert partitionId equals or not.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No need to add unit test.

Closes #37081 from dcoliversun/SPARK-39676.

Authored-by: Qian.Sun 
Signed-off-by: Hyukjin Kwon 
---
 core/src/test/scala/org/apache/spark/util/JsonProtocolSuite.scala | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/core/src/test/scala/org/apache/spark/util/JsonProtocolSuite.scala 
b/core/src/test/scala/org/apache/spark/util/JsonProtocolSuite.scala
index 36b61f67e3b..3b7929b278e 100644
--- a/core/src/test/scala/org/apache/spark/util/JsonProtocolSuite.scala
+++ b/core/src/test/scala/org/apache/spark/util/JsonProtocolSuite.scala
@@ -790,6 +790,8 @@ private[spark] object JsonProtocolSuite extends Assertions {
 assert(info1.taskId === info2.taskId)
 assert(info1.index === info2.index)
 assert(info1.attemptNumber === info2.attemptNumber)
+// The "Partition ID" field was added in Spark 3.3.0
+assert(info1.partitionId === info2.partitionId)
 assert(info1.launchTime === info2.launchTime)
 assert(info1.executorId === info2.executorId)
 assert(info1.host === info2.host)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (3c9b296928a -> 863c945b707)

2022-07-05 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 3c9b296928a [SPARK-39453][SQL][TESTS][FOLLOWUP] Let `RAND` in filter 
is more meaningful
 add 863c945b707 [SPARK-39676][CORE][TESTS] Add task partition id for 
TaskInfo assertEquals method in JsonProtocolSuite

No new revisions were added by this update.

Summary of changes:
 core/src/test/scala/org/apache/spark/util/JsonProtocolSuite.scala | 2 ++
 1 file changed, 2 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org