[spark] branch master updated (356665799cb -> 0be27b6735d)

2022-09-07 Thread mridulm80
This is an automated email from the ASF dual-hosted git repository.

mridulm80 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 356665799cb [SPARK-38909][BUILD][CORE][YARN][FOLLOWUP] Make some code 
cleanup related to shuffle state db
 add 0be27b6735d [SPARK-40186][CORE][YARN] Ensure `mergedShuffleCleaner` 
have been shutdown before `db` close

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/network/util/TransportConf.java   |  9 
 .../network/shuffle/RemoteBlockPushResolver.java   | 57 --
 .../network/shuffle/ShuffleTestAccessor.scala  |  4 ++
 .../network/yarn/YarnShuffleServiceSuite.scala | 26 ++
 4 files changed, 93 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (32ec7662416 -> 356665799cb)

2022-09-07 Thread mridulm80
This is an automated email from the ASF dual-hosted git repository.

mridulm80 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 32ec7662416 [SPARK-40365][BUILD] Bump ANTLR runtime version from 4.8 
to 4.9.3
 add 356665799cb [SPARK-38909][BUILD][CORE][YARN][FOLLOWUP] Make some code 
cleanup related to shuffle state db

No new revisions were added by this update.

Summary of changes:
 common/network-common/pom.xml   |  1 -
 .../main/java/org/apache/spark/network/shuffledb/DB.java|  3 +++
 .../java/org/apache/spark/network/shuffledb/DBIterator.java |  3 +++
 .../org/apache/spark/network/shuffledb/LevelDBIterator.java |  5 -
 .../spark/network/shuffle/ExternalShuffleBlockResolver.java | 13 +
 .../spark/network/shuffle/RemoteBlockPushResolver.java  | 13 +
 .../org/apache/spark/network/yarn/YarnShuffleService.java   |  4 ++--
 .../org/apache/spark/deploy/ExternalShuffleService.scala|  8 
 .../spark/deploy/yarn/YarnShuffleIntegrationSuite.scala |  4 ++--
 .../apache/spark/network/yarn/YarnShuffleServiceSuite.scala |  4 +++-
 10 files changed, 27 insertions(+), 31 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (333140fe908 -> 32ec7662416)

2022-09-07 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 333140fe908 [SPARK-40291][SQL] Improve the message for column not in 
group by clause error
 add 32ec7662416 [SPARK-40365][BUILD] Bump ANTLR runtime version from 4.8 
to 4.9.3

No new revisions were added by this update.

Summary of changes:
 dev/deps/spark-deps-hadoop-2-hive-2.3 | 2 +-
 dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 +-
 pom.xml   | 3 ++-
 3 files changed, 4 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (3ff2def958b -> 333140fe908)

2022-09-07 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 3ff2def958b [SPARK-40293][SQL] Make the V2 table error message more 
meaningful
 add 333140fe908 [SPARK-40291][SQL] Improve the message for column not in 
group by clause error

No new revisions were added by this update.

Summary of changes:
 core/src/main/resources/error/error-classes.json   |  6 ++
 .../sql/catalyst/analysis/CheckAnalysis.scala  |  6 +-
 .../spark/sql/errors/QueryCompilationErrors.scala  |  7 +++
 .../sql/catalyst/analysis/AnalysisErrorSuite.scala |  2 +-
 .../sql-tests/results/group-by-filter.sql.out  | 16 +--
 .../resources/sql-tests/results/group-by.sql.out   | 24 +++---
 .../sql-tests/results/grouping_set.sql.out |  8 +++-
 .../results/postgreSQL/create_view.sql.out |  8 +++-
 .../sql-tests/results/udf/udf-group-by.sql.out | 24 +++---
 .../apache/spark/sql/execution/SQLViewSuite.scala  | 10 -
 10 files changed, 90 insertions(+), 21 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (127ccc208aa -> 3ff2def958b)

2022-09-07 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 127ccc208aa [SPARK-40295][SQL] Allow v2 functions with literal args in 
write distribution/ordering
 add 3ff2def958b [SPARK-40293][SQL] Make the V2 table error message more 
meaningful

No new revisions were added by this update.

Summary of changes:
 core/src/main/resources/error/error-classes.json   |  5 ++
 .../spark/sql/errors/QueryCompilationErrors.scala  | 13 ++---
 .../catalyst/analysis/ResolveSessionCatalog.scala  | 55 +-
 .../datasources/v2/DataSourceV2Strategy.scala  | 11 +++--
 .../sql-tests/results/change-column.sql.out| 30 ++--
 .../spark/sql/connector/DataSourceV2SQLSuite.scala |  8 +++-
 .../spark/sql/connector/DeleteFromTests.scala  |  8 +++-
 .../spark/sql/execution/command/DDLSuite.scala |  8 +++-
 .../execution/command/PlanResolutionSuite.scala|  9 +++-
 9 files changed, 107 insertions(+), 40 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-40295][SQL] Allow v2 functions with literal args in write distribution/ordering

2022-09-07 Thread sunchao
This is an automated email from the ASF dual-hosted git repository.

sunchao pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 127ccc208aa [SPARK-40295][SQL] Allow v2 functions with literal args in 
write distribution/ordering
127ccc208aa is described below

commit 127ccc208aa8fd03f53dcb926087f1e72531bdbf
Author: aokolnychyi 
AuthorDate: Wed Sep 7 09:15:56 2022 -0700

[SPARK-40295][SQL] Allow v2 functions with literal args in write 
distribution/ordering

### What changes were proposed in this pull request?

This PR adapts `V2ExpressionUtils` to support arbitrary transforms with 
multiple args that are either references or literals.

### Why are the changes needed?

After PR #36995, data sources can request distribution and ordering that 
reference v2 functions. If a data source needs a transform with multiple input 
args or a transform where not all args are references, Spark will throw an 
exception.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

This PR adapts the test added recently in PR #36995.

Closes #37749 from aokolnychyi/spark-40295.

Lead-authored-by: aokolnychyi 
Co-authored-by: Anton Okolnychyi 
Signed-off-by: Chao Sun 
---
 .../catalyst/expressions/V2ExpressionUtils.scala   | 17 +---
 .../sql/catalyst/plans/physical/partitioning.scala | 20 ++
 .../sql/connector/catalog/InMemoryBaseTable.scala  |  8 ++
 .../datasources/v2/DataSourceV2ScanExecBase.scala  | 17 
 .../connector/KeyGroupedPartitioningSuite.scala| 29 ++--
 .../WriteDistributionAndOrderingSuite.scala| 32 --
 .../catalog/functions/transformFunctions.scala | 19 +
 7 files changed, 117 insertions(+), 25 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/V2ExpressionUtils.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/V2ExpressionUtils.scala
index 64eb307bb9f..06ecf79c58c 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/V2ExpressionUtils.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/V2ExpressionUtils.scala
@@ -28,7 +28,7 @@ import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.connector.catalog.{FunctionCatalog, Identifier}
 import org.apache.spark.sql.connector.catalog.functions._
 import 
org.apache.spark.sql.connector.catalog.functions.ScalarFunction.MAGIC_METHOD_NAME
-import org.apache.spark.sql.connector.expressions.{BucketTransform, Expression 
=> V2Expression, FieldReference, IdentityTransform, NamedReference, 
NamedTransform, NullOrdering => V2NullOrdering, SortDirection => 
V2SortDirection, SortOrder => V2SortOrder, SortValue, Transform}
+import org.apache.spark.sql.connector.expressions.{BucketTransform, Expression 
=> V2Expression, FieldReference, IdentityTransform, Literal => V2Literal, 
NamedReference, NamedTransform, NullOrdering => V2NullOrdering, SortDirection 
=> V2SortDirection, SortOrder => V2SortOrder, SortValue, Transform}
 import org.apache.spark.sql.errors.QueryCompilationErrors
 import org.apache.spark.sql.types._
 
@@ -75,6 +75,8 @@ object V2ExpressionUtils extends SQLConfHelper with Logging {
   query: LogicalPlan,
   funCatalogOpt: Option[FunctionCatalog] = None): Option[Expression] = {
 expr match {
+  case l: V2Literal[_] =>
+Some(Literal.create(l.value, l.dataType))
   case t: Transform =>
 toCatalystTransformOpt(t, query, funCatalogOpt)
   case SortValue(child, direction, nullOrdering) =>
@@ -105,18 +107,13 @@ object V2ExpressionUtils extends SQLConfHelper with 
Logging {
   TransformExpression(bound, resolvedRefs, Some(numBuckets))
 }
   }
-case NamedTransform(name, refs)
-if refs.length == 1 && refs.forall(_.isInstanceOf[NamedReference]) =>
-  val resolvedRefs = refs.map(_.asInstanceOf[NamedReference]).map { r =>
-resolveRef[NamedExpression](r, query)
-  }
+case NamedTransform(name, args) =>
+  val catalystArgs = args.map(toCatalyst(_, query, funCatalogOpt))
   funCatalogOpt.flatMap { catalog =>
-loadV2FunctionOpt(catalog, name, resolvedRefs).map { bound =>
-  TransformExpression(bound, resolvedRefs)
+loadV2FunctionOpt(catalog, name, catalystArgs).map { bound =>
+  TransformExpression(bound, catalystArgs)
 }
   }
-case _ =>
-  throw new AnalysisException(s"Transform $trans is not currently 
supported")
   }
 
   private def loadV2FunctionOpt(
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
 

[spark] branch branch-3.2 updated: [SPARK-40149][SQL][3.2] Propagate metadata columns through Project

2022-09-07 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new d566017de44 [SPARK-40149][SQL][3.2] Propagate metadata columns through 
Project
d566017de44 is described below

commit d566017de441beebfb62d9d9271defd4041ffdc4
Author: Wenchen Fan 
AuthorDate: Wed Sep 7 23:44:54 2022 +0800

[SPARK-40149][SQL][3.2] Propagate metadata columns through Project

backport https://github.com/apache/spark/pull/37758 to 3.2

### What changes were proposed in this pull request?

This PR fixes a regression caused by 
https://github.com/apache/spark/pull/32017 .

In https://github.com/apache/spark/pull/32017 , we tried to be more 
conservative and decided to not propagate metadata columns in certain 
operators, including `Project`. However, the decision was made only considering 
SQL API, not DataFrame API. In fact, it's very common to chain `Project` 
operators in DataFrame, e.g. `df.withColumn(...).withColumn(...)...`, and it's 
very inconvenient if metadata columns are not propagated through `Project`.

This PR makes 2 changes:
1. Project should propagate metadata columns
2. SubqueryAlias should only propagate metadata columns if the child is a 
leaf node or also a SubqueryAlias

The second change is needed to still forbid weird queries like `SELECT m 
from (SELECT a from t)`, which is the main motivation of 
https://github.com/apache/spark/pull/32017 .

After propagating metadata columns, a problem from 
https://github.com/apache/spark/pull/31666 is exposed: the natural join 
metadata columns may confuse the analyzer and lead to wrong analyzed plan. For 
example, `SELECT t1.value FROM t1 LEFT JOIN t2 USING (key) ORDER BY key`, how 
shall we resolve `ORDER BY key`? It should be resolved to `t1.key` via the rule 
`ResolveMissingReferences`, which is in the output of the left join. However, 
if `Project` can propagate metadata columns, `ORDER [...]

To solve this problem, this PR only allows qualified access for metadata 
columns of natural join. This has no breaking change, as people can only do 
qualified access for natural join metadata columns before, in the `Project` 
right after `Join`. This actually enables more use cases, as people can now 
access natural join metadata columns in ORDER BY. I've added a test for it.

### Why are the changes needed?

fix a regression

### Does this PR introduce _any_ user-facing change?

For SQL API, there is no change, as a `SubqueryAlias` always comes with a 
`Project` or `Aggregate`, so we still don't propagate metadata columns through 
a SELECT group.

For DataFrame API, the behavior becomes more lenient. The only breaking 
case is an operator that can propagate metadata columns then follows a 
`SubqueryAlias`, e.g. `df.filter(...).as("t").select("t.metadata_col")`. But 
this is a weird use case and I don't think we should support it at the first 
place.

### How was this patch tested?

new tests

Closes #37818 from cloud-fan/backport.

Authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/analysis/Analyzer.scala |   8 +-
 .../spark/sql/catalyst/analysis/unresolved.scala   |   2 +-
 .../spark/sql/catalyst/expressions/package.scala   |  13 +-
 .../plans/logical/basicLogicalOperators.scala  |  13 +-
 .../apache/spark/sql/catalyst/util/package.scala   |  15 +-
 .../test/resources/sql-tests/inputs/using-join.sql |   2 +
 .../resources/sql-tests/results/using-join.sql.out |  11 ++
 .../spark/sql/connector/DataSourceV2SQLSuite.scala | 218 
 .../spark/sql/connector/MetadataColumnSuite.scala  | 219 +
 9 files changed, 263 insertions(+), 238 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
index 305184d..8d6261a7847 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@@ -1043,9 +1043,11 @@ class Analyzer(override val catalogManager: 
CatalogManager)
 private def addMetadataCol(plan: LogicalPlan): LogicalPlan = plan match {
   case r: DataSourceV2Relation => r.withMetadataColumns()
   case p: Project =>
-p.copy(
+val newProj = p.copy(
   projectList = p.metadataOutput ++ p.projectList,
   child = addMetadataCol(p.child))
+newProj.copyTagsFrom(p)
+newProj
   case _ => plan.withNewChildren(plan.children.map(addMetadataCol))
 }
   }
@@ -3480,8 +3482,8 @@ class Analyzer(override val catalogManager: 
CatalogManager)
 val 

[GitHub] [spark-website] srowen closed pull request #411: Fix dead link in documentation.html and third-party-projects.html

2022-09-07 Thread GitBox


srowen closed pull request #411: Fix dead link in documentation.html and 
third-party-projects.html
URL: https://github.com/apache/spark-website/pull/411


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark-website] branch asf-site updated: Fix dead link in documentation.html and third-party-projects.html

2022-09-07 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/spark-website.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new b8b47eeff Fix dead link in documentation.html and 
third-party-projects.html
b8b47eeff is described below

commit b8b47eeffb0c1167b1eb9ef4b331dcd7e223d167
Author: yangjie01 
AuthorDate: Wed Sep 7 08:30:23 2022 -0500

Fix dead link in documentation.html and third-party-projects.html

This pr is the first part of SPARK-40322, there are a total of 5 pages with 
dead links:

- documentation.html
- third-party-projects.html
- release-process.html
- news
- powered-by.html

This pr fix `documentation.html` and `third-party-projects.html`

Author: yangjie01 

Closes #411 from LuciferYang/deadlink-p1.
---
 documentation.md   | 3 +--
 site/documentation.html| 3 +--
 site/third-party-projects.html | 4 +---
 third-party-projects.md| 4 +---
 4 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/documentation.md b/documentation.md
index 4b71656c4..f75f9dcf2 100644
--- a/documentation.md
+++ b/documentation.md
@@ -77,7 +77,7 @@ navigation:
 
 
 The documentation linked to above covers getting started with Spark, as 
well the built-in components MLlib,
-Spark 
Streaming, and GraphX.
+Spark 
Streaming, and GraphX.
 
 In addition, this page lists other resources for learning Spark.
 
@@ -176,7 +176,6 @@ Slides, videos and EC2-based exercises from each of these 
are available online:
 External Tutorials, Blog Posts, and Talks
 
 
-  http://engineering.ooyala.com/blog/using-parquet-and-scrooge-spark;>Using 
Parquet and Scrooge with Spark  Scala-friendly Parquet and Avro 
usage tutorial from Ooyala's Evan Chan
   http://codeforhire.com/2014/02/18/using-spark-with-mongodb/;>Using Spark 
with MongoDB  by Sampo Niskanen from Wellmo
   https://spark-summit.org/2013;>Spark Summit 2013  
contained 30 talks about Spark use cases, available as slides and videos
   http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/;>A 
Powerful Big Data Trio: Spark, Parquet and Avro  Using Parquet in 
Spark by Matt Massie
diff --git a/site/documentation.html b/site/documentation.html
index ef0095ce6..cd0fc8cdf 100644
--- a/site/documentation.html
+++ b/site/documentation.html
@@ -192,7 +192,7 @@
 
 
 The documentation linked to above covers getting started with Spark, as 
well the built-in components MLlib,
-Spark Streaming, 
and GraphX.
+Spark Streaming, 
and GraphX.
 
 In addition, this page lists other resources for learning Spark.
 
@@ -290,7 +290,6 @@ Slides, videos and EC2-based exercises from each of these 
are available online:
 External Tutorials, Blog Posts, and Talks
 
 
-  http://engineering.ooyala.com/blog/using-parquet-and-scrooge-spark;>Using 
Parquet and Scrooge with Spark  Scala-friendly Parquet and Avro 
usage tutorial from Ooyala's Evan Chan
   http://codeforhire.com/2014/02/18/using-spark-with-mongodb/;>Using Spark 
with MongoDB  by Sampo Niskanen from Wellmo
   https://spark-summit.org/2013;>Spark Summit 2013  
contained 30 talks about Spark use cases, available as slides and videos
   http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/;>A 
Powerful Big Data Trio: Spark, Parquet and Avro  Using Parquet in 
Spark by Matt Massie
diff --git a/site/third-party-projects.html b/site/third-party-projects.html
index bb07e2e8e..fedefe081 100644
--- a/site/third-party-projects.html
+++ b/site/third-party-projects.html
@@ -170,14 +170,12 @@ against Spark, and data scientists to use Javascript in 
Jupyter notebooks.
 Mahout has switched to using Spark as the backend
   https://wiki.apache.org/mrql/;>Apache MRQL - A query 
processing and optimization 
 system for large-scale, distributed data analysis, built on top of Apache 
Hadoop, Hama, and Spark
-  http://blinkdb.org/;>BlinkDB - a massively parallel, 
approximate query engine built 
+  https://github.com/sameeragarwal/blinkdb;>BlinkDB - a 
massively parallel, approximate query engine built 
 on top of Shark and Spark
   https://github.com/adobe-research/spindle;>Spindle - 
Spark/Parquet-based web 
 analytics query engine
   https://github.com/thunderain-project/thunderain;>Thunderain - a 
framework 
 for combining stream processing with historical data, think Lambda 
architecture
-  https://github.com/AyasdiOpenSource/df;>DF from Ayasdi - a 
Pandas-like data frame 
-implementation for Spark
   https://github.com/OryxProject/oryx;>Oryx -  Lambda 
architecture on Apache Spark, 
 Apache Kafka for real-time large scale machine learning
   https://github.com/bigdatagenomics/adam;>ADAM - A framework 
and CLI for loading, 
diff --git a/third-party-projects.md b/third-party-projects.md
index b08c18e13..1db600c35 100644
--- a/third-party-projects.md
+++ b/third-party-projects.md
@@ -52,14 +52,12 @@ against Spark, and data 

[spark] branch master updated: [SPARK-40185][SQL] Remove column suggestion when the candidate list is empty

2022-09-07 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 32567e94b8a [SPARK-40185][SQL] Remove column suggestion when the 
candidate list is empty
32567e94b8a is described below

commit 32567e94b8ad2550d8b0b4d73e2dfd441d426ecc
Author: Vitalii Li 
AuthorDate: Wed Sep 7 19:50:33 2022 +0800

[SPARK-40185][SQL] Remove column suggestion when the candidate list is empty

### What changes were proposed in this pull request?

1. Remove column, attribute or map key suggestion from `UNRESOLVED_*` error 
if candidate list is empty.
2. Sort suggested columns by closeness to unresolved column
3. Limit number of candidates to 5. Previously entire list of existing 
columns were shown as suggestions.

### Why are the changes needed?

When the list of candidates is empty the error message looks incomplete:

`[UNRESOLVED_COLUMN] A column or function parameter with name 'YrMo' cannot 
be resolved. Did you mean one of the following? []`

This PR is to introduce `GENERIC` error subclass without suggestion and 
`WITH_SUGGESTION` subclass where error message includes suggested 
fields/columns:

`[UNRESOLVED_COLUMN.GENERIC] A column or function parameter with name 
'YrMo' cannot be resolved.`

OR

`[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with 
name 'YrMo' cannot be resolved. Did you mean one of the following? 
['YearAndMonth', 'Year', 'Month']`

In addition suggested column names are sorted by Levenstein distance and 
capped to 5.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test

Closes #37621 from vitaliili-db/SC-108622.

Authored-by: Vitalii Li 
Signed-off-by: Wenchen Fan 
---
 core/src/main/resources/error/error-classes.json   | 42 +-
 .../org/apache/spark/SparkThrowableSuite.scala | 16 ++--
 .../spark/sql/catalyst/analysis/Analyzer.scala | 11 ++-
 .../plans/logical/basicLogicalOperators.scala  |  5 +-
 .../spark/sql/errors/QueryCompilationErrors.scala  | 33 +---
 .../sql/catalyst/analysis/AnalysisErrorSuite.scala | 21 -
 .../sql/catalyst/analysis/AnalysisSuite.scala  | 12 ++-
 .../spark/sql/catalyst/analysis/AnalysisTest.scala | 23 +-
 .../catalyst/analysis/ResolveSubquerySuite.scala   | 24 +++---
 .../catalyst/analysis/V2WriteAnalysisSuite.scala   | 14 +++-
 .../results/columnresolution-negative.sql.out  | 12 ++-
 .../resources/sql-tests/results/group-by.sql.out   |  6 +-
 .../sql-tests/results/join-lateral.sql.out | 15 ++--
 .../sql-tests/results/natural-join.sql.out |  3 +-
 .../test/resources/sql-tests/results/pivot.sql.out |  6 +-
 .../results/postgreSQL/aggregates_part1.sql.out|  3 +-
 .../results/postgreSQL/create_view.sql.out |  4 +-
 .../sql-tests/results/postgreSQL/join.sql.out  | 28 ---
 .../results/postgreSQL/select_having.sql.out   |  3 +-
 .../results/postgreSQL/select_implicit.sql.out |  6 +-
 .../sql-tests/results/postgreSQL/union.sql.out |  3 +-
 .../sql-tests/results/query_regex_column.sql.out   | 24 --
 .../negative-cases/invalid-correlation.sql.out |  3 +-
 .../sql-tests/results/table-aliases.sql.out|  3 +-
 .../udf/postgreSQL/udf-aggregates_part1.sql.out|  3 +-
 .../results/udf/postgreSQL/udf-join.sql.out| 28 ---
 .../udf/postgreSQL/udf-select_having.sql.out   |  3 +-
 .../udf/postgreSQL/udf-select_implicit.sql.out |  6 +-
 .../sql-tests/results/udf/udf-group-by.sql.out |  3 +-
 .../sql-tests/results/udf/udf-pivot.sql.out|  6 +-
 .../apache/spark/sql/DataFrameFunctionsSuite.scala | 89 --
 .../apache/spark/sql/DataFrameSelfJoinSuite.scala  |  3 +-
 .../apache/spark/sql/DataFrameToSchemaSuite.scala  |  4 +-
 .../spark/sql/DataFrameWindowFunctionsSuite.scala  |  9 ++-
 .../scala/org/apache/spark/sql/DatasetSuite.scala  | 44 ++-
 .../org/apache/spark/sql/DatasetUnpivotSuite.scala |  9 ++-
 .../org/apache/spark/sql/SQLInsertTestSuite.scala  | 10 ++-
 .../scala/org/apache/spark/sql/SubquerySuite.scala | 11 ++-
 .../test/scala/org/apache/spark/sql/UDFSuite.scala | 11 ++-
 .../spark/sql/connector/DataSourceV2SQLSuite.scala | 19 +++--
 .../sql/errors/QueryCompilationErrorsSuite.scala   | 11 ++-
 .../apache/spark/sql/execution/SQLViewSuite.scala  |  6 +-
 .../execution/command/v2/DescribeTableSuite.scala  |  6 +-
 .../org/apache/spark/sql/sources/InsertSuite.scala | 11 +--
 .../apache/spark/sql/hive/HiveParquetSuite.scala   |  9 ++-
 45 files changed, 423 insertions(+), 198 deletions(-)

diff --git a/core/src/main/resources/error/error-classes.json 
b/core/src/main/resources/error/error-classes.json
index b923d5a39e0..f39ee465768 100644
--- 

[spark] branch branch-3.3 updated: add back a mistakenly removed test case

2022-09-07 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 81cb08b7b3a add back a mistakenly removed test case
81cb08b7b3a is described below

commit 81cb08b7b3ae6a4ccfa9787ec39a6041fae8143f
Author: Wenchen Fan 
AuthorDate: Wed Sep 7 19:29:39 2022 +0800

add back a mistakenly removed test case
---
 .../spark/sql/connector/DataSourceV2SQLSuite.scala | 24 ++
 1 file changed, 24 insertions(+)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala
index 304c77fd003..44f97f55713 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala
@@ -2288,6 +2288,30 @@ class DataSourceV2SQLSuite
 }
   }
 
+  test("SPARK-34561: drop/add columns to a dataset of `DESCRIBE TABLE`") {
+val tbl = s"${catalogAndNamespace}tbl"
+withTable(tbl) {
+  sql(s"CREATE TABLE $tbl (c0 INT) USING $v2Format")
+  val description = sql(s"DESCRIBE TABLE $tbl")
+  val noCommentDataset = description.drop("comment")
+  val expectedSchema = new StructType()
+.add(
+  name = "col_name",
+  dataType = StringType,
+  nullable = false,
+  metadata = new MetadataBuilder().putString("comment", "name of the 
column").build())
+.add(
+  name = "data_type",
+  dataType = StringType,
+  nullable = false,
+  metadata = new MetadataBuilder().putString("comment", "data type of 
the column").build())
+  assert(noCommentDataset.schema === expectedSchema)
+  val isNullDataset = noCommentDataset
+.withColumn("is_null", noCommentDataset("col_name").isNull)
+  assert(isNullDataset.schema === expectedSchema.add("is_null", 
BooleanType, false))
+}
+  }
+
   test("SPARK-34576: drop/add columns to a dataset of `DESCRIBE COLUMN`") {
 val tbl = s"${catalogAndNamespace}tbl"
 withTable(tbl) {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.3 updated: [SPARK-40149][SQL] Propagate metadata columns through Project

2022-09-07 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 433469f284e [SPARK-40149][SQL] Propagate metadata columns through 
Project
433469f284e is described below

commit 433469f284ee24150f6cff4005d39a70e91cc4d9
Author: Wenchen Fan 
AuthorDate: Wed Sep 7 18:45:20 2022 +0800

[SPARK-40149][SQL] Propagate metadata columns through Project

This PR fixes a regression caused by 
https://github.com/apache/spark/pull/32017 .

In https://github.com/apache/spark/pull/32017 , we tried to be more 
conservative and decided to not propagate metadata columns in certain 
operators, including `Project`. However, the decision was made only considering 
SQL API, not DataFrame API. In fact, it's very common to chain `Project` 
operators in DataFrame, e.g. `df.withColumn(...).withColumn(...)...`, and it's 
very inconvenient if metadata columns are not propagated through `Project`.

This PR makes 2 changes:
1. Project should propagate metadata columns
2. SubqueryAlias should only propagate metadata columns if the child is a 
leaf node or also a SubqueryAlias

The second change is needed to still forbid weird queries like `SELECT m 
from (SELECT a from t)`, which is the main motivation of 
https://github.com/apache/spark/pull/32017 .

After propagating metadata columns, a problem from 
https://github.com/apache/spark/pull/31666 is exposed: the natural join 
metadata columns may confuse the analyzer and lead to wrong analyzed plan. For 
example, `SELECT t1.value FROM t1 LEFT JOIN t2 USING (key) ORDER BY key`, how 
shall we resolve `ORDER BY key`? It should be resolved to `t1.key` via the rule 
`ResolveMissingReferences`, which is in the output of the left join. However, 
if `Project` can propagate metadata columns, `ORDER [...]

To solve this problem, this PR only allows qualified access for metadata 
columns of natural join. This has no breaking change, as people can only do 
qualified access for natural join metadata columns before, in the `Project` 
right after `Join`. This actually enables more use cases, as people can now 
access natural join metadata columns in ORDER BY. I've added a test for it.

fix a regression

For SQL API, there is no change, as a `SubqueryAlias` always comes with a 
`Project` or `Aggregate`, so we still don't propagate metadata columns through 
a SELECT group.

For DataFrame API, the behavior becomes more lenient. The only breaking 
case is an operator that can propagate metadata columns then follows a 
`SubqueryAlias`, e.g. `df.filter(...).as("t").select("t.metadata_col")`. But 
this is a weird use case and I don't think we should support it at the first 
place.

new tests

Closes #37758 from cloud-fan/metadata.

Authored-by: Wenchen Fan 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 99ae1d9a897909990881f14c5ea70a0d1a0bf456)
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/analysis/Analyzer.scala |   8 +-
 .../spark/sql/catalyst/analysis/unresolved.scala   |   2 +-
 .../spark/sql/catalyst/expressions/package.scala   |  13 +-
 .../plans/logical/basicLogicalOperators.scala  |  13 +-
 .../apache/spark/sql/catalyst/util/package.scala   |  15 +-
 .../test/resources/sql-tests/inputs/using-join.sql |   2 +
 .../resources/sql-tests/results/using-join.sql.out |  11 +
 .../spark/sql/connector/DataSourceV2SQLSuite.scala | 242 -
 .../spark/sql/connector/MetadataColumnSuite.scala  | 219 +++
 9 files changed, 263 insertions(+), 262 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
index 37024e15377..3a3997ff9c7 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@@ -967,9 +967,11 @@ class Analyzer(override val catalogManager: CatalogManager)
 private def addMetadataCol(plan: LogicalPlan): LogicalPlan = plan match {
   case s: ExposesMetadataColumns => s.withMetadataColumns()
   case p: Project =>
-p.copy(
+val newProj = p.copy(
   projectList = p.metadataOutput ++ p.projectList,
   child = addMetadataCol(p.child))
+newProj.copyTagsFrom(p)
+newProj
   case _ => plan.withNewChildren(plan.children.map(addMetadataCol))
 }
   }
@@ -3475,8 +3477,8 @@ class Analyzer(override val catalogManager: 
CatalogManager)
 val project = Project(projectList, Join(left, right, joinType, 
newCondition, hint))
 project.setTagValue(
   Project.hiddenOutputTag,
-  

[spark] branch master updated (e6c58c1bd6f -> 99ae1d9a897)

2022-09-07 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from e6c58c1bd6f [SPARK-40273][PYTHON][DOCS] Fix the documents 
"Contributing and Maintaining Type Hints"
 add 99ae1d9a897 [SPARK-40149][SQL] Propagate metadata columns through 
Project

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/analysis/Analyzer.scala |   8 +-
 .../spark/sql/catalyst/analysis/unresolved.scala   |   2 +-
 .../spark/sql/catalyst/expressions/package.scala   |  13 +-
 .../plans/logical/basicLogicalOperators.scala  |  13 +-
 .../apache/spark/sql/catalyst/util/package.scala   |  15 +-
 .../test/resources/sql-tests/inputs/using-join.sql |   2 +
 .../resources/sql-tests/results/using-join.sql.out |  11 ++
 .../spark/sql/connector/DataSourceV2SQLSuite.scala | 218 
 .../spark/sql/connector/MetadataColumnSuite.scala  | 219 +
 9 files changed, 263 insertions(+), 238 deletions(-)
 create mode 100644 
sql/core/src/test/scala/org/apache/spark/sql/connector/MetadataColumnSuite.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-40273][PYTHON][DOCS] Fix the documents "Contributing and Maintaining Type Hints"

2022-09-07 Thread zero323
This is an automated email from the ASF dual-hosted git repository.

zero323 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e6c58c1bd6f [SPARK-40273][PYTHON][DOCS] Fix the documents 
"Contributing and Maintaining Type Hints"
e6c58c1bd6f is described below

commit e6c58c1bd6f64ebfb337348fa6132c0b230dc932
Author: itholic 
AuthorDate: Wed Sep 7 11:29:45 2022 +0200

[SPARK-40273][PYTHON][DOCS] Fix the documents "Contributing and Maintaining 
Type Hints"

### What changes were proposed in this pull request?

This PR proposes to fix the [Contributing and Maintaining Type 
Hints](https://spark.apache.org/docs/latest/api/python/development/contributing.html#contributing-and-maintaining-type-hints)
 since the existing type hints in the stub files are all ported into inline 
type hints.

### Why are the changes needed?

We no longer use the stub files for type hinting, so we might need to 
change the documents as well.

### Does this PR introduce _any_ user-facing change?

Yes, the documentation change.

### How was this patch tested?

The existing documentation build should pass

Closes #37724 from itholic/SPARK-40273.

Authored-by: itholic 
Signed-off-by: zero323 
---
 python/docs/source/development/contributing.rst | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/python/docs/source/development/contributing.rst 
b/python/docs/source/development/contributing.rst
index 9780d6eca4e..3d388e91012 100644
--- a/python/docs/source/development/contributing.rst
+++ b/python/docs/source/development/contributing.rst
@@ -155,10 +155,7 @@ Now, you can start developing and `running the tests 
`_.
 Contributing and Maintaining Type Hints
 
 
-PySpark type hints are provided using stub files, placed in the same directory 
as the annotated module, with exception to:
-
-* ``# type: ignore`` in modules which don't have their own stubs (tests, 
examples and non-public API). 
-* pandas API on Spark (``pyspark.pandas`` package) where the type hints are 
inlined.
+PySpark type hints are inlined, to take advantage of static type checking.
 
 As a rule of thumb, only public API is annotated.
 
@@ -166,7 +163,7 @@ Annotations should, when possible:
 
 * Reflect expectations of the underlying JVM API, to help avoid type related 
failures outside Python interpreter.
 * In case of conflict between too broad (``Any``) and too narrow argument 
annotations, prefer the latter as one, as long as it is covering most of the 
typical use cases.
-* Indicate nonsensical combinations of arguments using ``@overload``  
annotations. For example, to indicate that ``*Col`` and ``*Cols`` arguments are 
mutually exclusive:
+* Indicate nonsensical combinations of arguments using ``@overload`` 
annotations. For example, to indicate that ``*Col`` and ``*Cols`` arguments are 
mutually exclusive:
 
   .. code-block:: python
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org