[spark] branch master updated (356665799cb -> 0be27b6735d)
This is an automated email from the ASF dual-hosted git repository. mridulm80 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 356665799cb [SPARK-38909][BUILD][CORE][YARN][FOLLOWUP] Make some code cleanup related to shuffle state db add 0be27b6735d [SPARK-40186][CORE][YARN] Ensure `mergedShuffleCleaner` have been shutdown before `db` close No new revisions were added by this update. Summary of changes: .../apache/spark/network/util/TransportConf.java | 9 .../network/shuffle/RemoteBlockPushResolver.java | 57 -- .../network/shuffle/ShuffleTestAccessor.scala | 4 ++ .../network/yarn/YarnShuffleServiceSuite.scala | 26 ++ 4 files changed, 93 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (32ec7662416 -> 356665799cb)
This is an automated email from the ASF dual-hosted git repository. mridulm80 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 32ec7662416 [SPARK-40365][BUILD] Bump ANTLR runtime version from 4.8 to 4.9.3 add 356665799cb [SPARK-38909][BUILD][CORE][YARN][FOLLOWUP] Make some code cleanup related to shuffle state db No new revisions were added by this update. Summary of changes: common/network-common/pom.xml | 1 - .../main/java/org/apache/spark/network/shuffledb/DB.java| 3 +++ .../java/org/apache/spark/network/shuffledb/DBIterator.java | 3 +++ .../org/apache/spark/network/shuffledb/LevelDBIterator.java | 5 - .../spark/network/shuffle/ExternalShuffleBlockResolver.java | 13 + .../spark/network/shuffle/RemoteBlockPushResolver.java | 13 + .../org/apache/spark/network/yarn/YarnShuffleService.java | 4 ++-- .../org/apache/spark/deploy/ExternalShuffleService.scala| 8 .../spark/deploy/yarn/YarnShuffleIntegrationSuite.scala | 4 ++-- .../apache/spark/network/yarn/YarnShuffleServiceSuite.scala | 4 +++- 10 files changed, 27 insertions(+), 31 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (333140fe908 -> 32ec7662416)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 333140fe908 [SPARK-40291][SQL] Improve the message for column not in group by clause error add 32ec7662416 [SPARK-40365][BUILD] Bump ANTLR runtime version from 4.8 to 4.9.3 No new revisions were added by this update. Summary of changes: dev/deps/spark-deps-hadoop-2-hive-2.3 | 2 +- dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 +- pom.xml | 3 ++- 3 files changed, 4 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (3ff2def958b -> 333140fe908)
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 3ff2def958b [SPARK-40293][SQL] Make the V2 table error message more meaningful add 333140fe908 [SPARK-40291][SQL] Improve the message for column not in group by clause error No new revisions were added by this update. Summary of changes: core/src/main/resources/error/error-classes.json | 6 ++ .../sql/catalyst/analysis/CheckAnalysis.scala | 6 +- .../spark/sql/errors/QueryCompilationErrors.scala | 7 +++ .../sql/catalyst/analysis/AnalysisErrorSuite.scala | 2 +- .../sql-tests/results/group-by-filter.sql.out | 16 +-- .../resources/sql-tests/results/group-by.sql.out | 24 +++--- .../sql-tests/results/grouping_set.sql.out | 8 +++- .../results/postgreSQL/create_view.sql.out | 8 +++- .../sql-tests/results/udf/udf-group-by.sql.out | 24 +++--- .../apache/spark/sql/execution/SQLViewSuite.scala | 10 - 10 files changed, 90 insertions(+), 21 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (127ccc208aa -> 3ff2def958b)
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 127ccc208aa [SPARK-40295][SQL] Allow v2 functions with literal args in write distribution/ordering add 3ff2def958b [SPARK-40293][SQL] Make the V2 table error message more meaningful No new revisions were added by this update. Summary of changes: core/src/main/resources/error/error-classes.json | 5 ++ .../spark/sql/errors/QueryCompilationErrors.scala | 13 ++--- .../catalyst/analysis/ResolveSessionCatalog.scala | 55 +- .../datasources/v2/DataSourceV2Strategy.scala | 11 +++-- .../sql-tests/results/change-column.sql.out| 30 ++-- .../spark/sql/connector/DataSourceV2SQLSuite.scala | 8 +++- .../spark/sql/connector/DeleteFromTests.scala | 8 +++- .../spark/sql/execution/command/DDLSuite.scala | 8 +++- .../execution/command/PlanResolutionSuite.scala| 9 +++- 9 files changed, 107 insertions(+), 40 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-40295][SQL] Allow v2 functions with literal args in write distribution/ordering
This is an automated email from the ASF dual-hosted git repository. sunchao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 127ccc208aa [SPARK-40295][SQL] Allow v2 functions with literal args in write distribution/ordering 127ccc208aa is described below commit 127ccc208aa8fd03f53dcb926087f1e72531bdbf Author: aokolnychyi AuthorDate: Wed Sep 7 09:15:56 2022 -0700 [SPARK-40295][SQL] Allow v2 functions with literal args in write distribution/ordering ### What changes were proposed in this pull request? This PR adapts `V2ExpressionUtils` to support arbitrary transforms with multiple args that are either references or literals. ### Why are the changes needed? After PR #36995, data sources can request distribution and ordering that reference v2 functions. If a data source needs a transform with multiple input args or a transform where not all args are references, Spark will throw an exception. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR adapts the test added recently in PR #36995. Closes #37749 from aokolnychyi/spark-40295. Lead-authored-by: aokolnychyi Co-authored-by: Anton Okolnychyi Signed-off-by: Chao Sun --- .../catalyst/expressions/V2ExpressionUtils.scala | 17 +--- .../sql/catalyst/plans/physical/partitioning.scala | 20 ++ .../sql/connector/catalog/InMemoryBaseTable.scala | 8 ++ .../datasources/v2/DataSourceV2ScanExecBase.scala | 17 .../connector/KeyGroupedPartitioningSuite.scala| 29 ++-- .../WriteDistributionAndOrderingSuite.scala| 32 -- .../catalog/functions/transformFunctions.scala | 19 + 7 files changed, 117 insertions(+), 25 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/V2ExpressionUtils.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/V2ExpressionUtils.scala index 64eb307bb9f..06ecf79c58c 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/V2ExpressionUtils.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/V2ExpressionUtils.scala @@ -28,7 +28,7 @@ import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan import org.apache.spark.sql.connector.catalog.{FunctionCatalog, Identifier} import org.apache.spark.sql.connector.catalog.functions._ import org.apache.spark.sql.connector.catalog.functions.ScalarFunction.MAGIC_METHOD_NAME -import org.apache.spark.sql.connector.expressions.{BucketTransform, Expression => V2Expression, FieldReference, IdentityTransform, NamedReference, NamedTransform, NullOrdering => V2NullOrdering, SortDirection => V2SortDirection, SortOrder => V2SortOrder, SortValue, Transform} +import org.apache.spark.sql.connector.expressions.{BucketTransform, Expression => V2Expression, FieldReference, IdentityTransform, Literal => V2Literal, NamedReference, NamedTransform, NullOrdering => V2NullOrdering, SortDirection => V2SortDirection, SortOrder => V2SortOrder, SortValue, Transform} import org.apache.spark.sql.errors.QueryCompilationErrors import org.apache.spark.sql.types._ @@ -75,6 +75,8 @@ object V2ExpressionUtils extends SQLConfHelper with Logging { query: LogicalPlan, funCatalogOpt: Option[FunctionCatalog] = None): Option[Expression] = { expr match { + case l: V2Literal[_] => +Some(Literal.create(l.value, l.dataType)) case t: Transform => toCatalystTransformOpt(t, query, funCatalogOpt) case SortValue(child, direction, nullOrdering) => @@ -105,18 +107,13 @@ object V2ExpressionUtils extends SQLConfHelper with Logging { TransformExpression(bound, resolvedRefs, Some(numBuckets)) } } -case NamedTransform(name, refs) -if refs.length == 1 && refs.forall(_.isInstanceOf[NamedReference]) => - val resolvedRefs = refs.map(_.asInstanceOf[NamedReference]).map { r => -resolveRef[NamedExpression](r, query) - } +case NamedTransform(name, args) => + val catalystArgs = args.map(toCatalyst(_, query, funCatalogOpt)) funCatalogOpt.flatMap { catalog => -loadV2FunctionOpt(catalog, name, resolvedRefs).map { bound => - TransformExpression(bound, resolvedRefs) +loadV2FunctionOpt(catalog, name, catalystArgs).map { bound => + TransformExpression(bound, catalystArgs) } } -case _ => - throw new AnalysisException(s"Transform $trans is not currently supported") } private def loadV2FunctionOpt( diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
[spark] branch branch-3.2 updated: [SPARK-40149][SQL][3.2] Propagate metadata columns through Project
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new d566017de44 [SPARK-40149][SQL][3.2] Propagate metadata columns through Project d566017de44 is described below commit d566017de441beebfb62d9d9271defd4041ffdc4 Author: Wenchen Fan AuthorDate: Wed Sep 7 23:44:54 2022 +0800 [SPARK-40149][SQL][3.2] Propagate metadata columns through Project backport https://github.com/apache/spark/pull/37758 to 3.2 ### What changes were proposed in this pull request? This PR fixes a regression caused by https://github.com/apache/spark/pull/32017 . In https://github.com/apache/spark/pull/32017 , we tried to be more conservative and decided to not propagate metadata columns in certain operators, including `Project`. However, the decision was made only considering SQL API, not DataFrame API. In fact, it's very common to chain `Project` operators in DataFrame, e.g. `df.withColumn(...).withColumn(...)...`, and it's very inconvenient if metadata columns are not propagated through `Project`. This PR makes 2 changes: 1. Project should propagate metadata columns 2. SubqueryAlias should only propagate metadata columns if the child is a leaf node or also a SubqueryAlias The second change is needed to still forbid weird queries like `SELECT m from (SELECT a from t)`, which is the main motivation of https://github.com/apache/spark/pull/32017 . After propagating metadata columns, a problem from https://github.com/apache/spark/pull/31666 is exposed: the natural join metadata columns may confuse the analyzer and lead to wrong analyzed plan. For example, `SELECT t1.value FROM t1 LEFT JOIN t2 USING (key) ORDER BY key`, how shall we resolve `ORDER BY key`? It should be resolved to `t1.key` via the rule `ResolveMissingReferences`, which is in the output of the left join. However, if `Project` can propagate metadata columns, `ORDER [...] To solve this problem, this PR only allows qualified access for metadata columns of natural join. This has no breaking change, as people can only do qualified access for natural join metadata columns before, in the `Project` right after `Join`. This actually enables more use cases, as people can now access natural join metadata columns in ORDER BY. I've added a test for it. ### Why are the changes needed? fix a regression ### Does this PR introduce _any_ user-facing change? For SQL API, there is no change, as a `SubqueryAlias` always comes with a `Project` or `Aggregate`, so we still don't propagate metadata columns through a SELECT group. For DataFrame API, the behavior becomes more lenient. The only breaking case is an operator that can propagate metadata columns then follows a `SubqueryAlias`, e.g. `df.filter(...).as("t").select("t.metadata_col")`. But this is a weird use case and I don't think we should support it at the first place. ### How was this patch tested? new tests Closes #37818 from cloud-fan/backport. Authored-by: Wenchen Fan Signed-off-by: Wenchen Fan --- .../spark/sql/catalyst/analysis/Analyzer.scala | 8 +- .../spark/sql/catalyst/analysis/unresolved.scala | 2 +- .../spark/sql/catalyst/expressions/package.scala | 13 +- .../plans/logical/basicLogicalOperators.scala | 13 +- .../apache/spark/sql/catalyst/util/package.scala | 15 +- .../test/resources/sql-tests/inputs/using-join.sql | 2 + .../resources/sql-tests/results/using-join.sql.out | 11 ++ .../spark/sql/connector/DataSourceV2SQLSuite.scala | 218 .../spark/sql/connector/MetadataColumnSuite.scala | 219 + 9 files changed, 263 insertions(+), 238 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala index 305184d..8d6261a7847 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala @@ -1043,9 +1043,11 @@ class Analyzer(override val catalogManager: CatalogManager) private def addMetadataCol(plan: LogicalPlan): LogicalPlan = plan match { case r: DataSourceV2Relation => r.withMetadataColumns() case p: Project => -p.copy( +val newProj = p.copy( projectList = p.metadataOutput ++ p.projectList, child = addMetadataCol(p.child)) +newProj.copyTagsFrom(p) +newProj case _ => plan.withNewChildren(plan.children.map(addMetadataCol)) } } @@ -3480,8 +3482,8 @@ class Analyzer(override val catalogManager: CatalogManager) val
[GitHub] [spark-website] srowen closed pull request #411: Fix dead link in documentation.html and third-party-projects.html
srowen closed pull request #411: Fix dead link in documentation.html and third-party-projects.html URL: https://github.com/apache/spark-website/pull/411 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark-website] branch asf-site updated: Fix dead link in documentation.html and third-party-projects.html
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/spark-website.git The following commit(s) were added to refs/heads/asf-site by this push: new b8b47eeff Fix dead link in documentation.html and third-party-projects.html b8b47eeff is described below commit b8b47eeffb0c1167b1eb9ef4b331dcd7e223d167 Author: yangjie01 AuthorDate: Wed Sep 7 08:30:23 2022 -0500 Fix dead link in documentation.html and third-party-projects.html This pr is the first part of SPARK-40322, there are a total of 5 pages with dead links: - documentation.html - third-party-projects.html - release-process.html - news - powered-by.html This pr fix `documentation.html` and `third-party-projects.html` Author: yangjie01 Closes #411 from LuciferYang/deadlink-p1. --- documentation.md | 3 +-- site/documentation.html| 3 +-- site/third-party-projects.html | 4 +--- third-party-projects.md| 4 +--- 4 files changed, 4 insertions(+), 10 deletions(-) diff --git a/documentation.md b/documentation.md index 4b71656c4..f75f9dcf2 100644 --- a/documentation.md +++ b/documentation.md @@ -77,7 +77,7 @@ navigation: The documentation linked to above covers getting started with Spark, as well the built-in components MLlib, -Spark Streaming, and GraphX. +Spark Streaming, and GraphX. In addition, this page lists other resources for learning Spark. @@ -176,7 +176,6 @@ Slides, videos and EC2-based exercises from each of these are available online: External Tutorials, Blog Posts, and Talks - http://engineering.ooyala.com/blog/using-parquet-and-scrooge-spark;>Using Parquet and Scrooge with Spark Scala-friendly Parquet and Avro usage tutorial from Ooyala's Evan Chan http://codeforhire.com/2014/02/18/using-spark-with-mongodb/;>Using Spark with MongoDB by Sampo Niskanen from Wellmo https://spark-summit.org/2013;>Spark Summit 2013 contained 30 talks about Spark use cases, available as slides and videos http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/;>A Powerful Big Data Trio: Spark, Parquet and Avro Using Parquet in Spark by Matt Massie diff --git a/site/documentation.html b/site/documentation.html index ef0095ce6..cd0fc8cdf 100644 --- a/site/documentation.html +++ b/site/documentation.html @@ -192,7 +192,7 @@ The documentation linked to above covers getting started with Spark, as well the built-in components MLlib, -Spark Streaming, and GraphX. +Spark Streaming, and GraphX. In addition, this page lists other resources for learning Spark. @@ -290,7 +290,6 @@ Slides, videos and EC2-based exercises from each of these are available online: External Tutorials, Blog Posts, and Talks - http://engineering.ooyala.com/blog/using-parquet-and-scrooge-spark;>Using Parquet and Scrooge with Spark Scala-friendly Parquet and Avro usage tutorial from Ooyala's Evan Chan http://codeforhire.com/2014/02/18/using-spark-with-mongodb/;>Using Spark with MongoDB by Sampo Niskanen from Wellmo https://spark-summit.org/2013;>Spark Summit 2013 contained 30 talks about Spark use cases, available as slides and videos http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/;>A Powerful Big Data Trio: Spark, Parquet and Avro Using Parquet in Spark by Matt Massie diff --git a/site/third-party-projects.html b/site/third-party-projects.html index bb07e2e8e..fedefe081 100644 --- a/site/third-party-projects.html +++ b/site/third-party-projects.html @@ -170,14 +170,12 @@ against Spark, and data scientists to use Javascript in Jupyter notebooks. Mahout has switched to using Spark as the backend https://wiki.apache.org/mrql/;>Apache MRQL - A query processing and optimization system for large-scale, distributed data analysis, built on top of Apache Hadoop, Hama, and Spark - http://blinkdb.org/;>BlinkDB - a massively parallel, approximate query engine built + https://github.com/sameeragarwal/blinkdb;>BlinkDB - a massively parallel, approximate query engine built on top of Shark and Spark https://github.com/adobe-research/spindle;>Spindle - Spark/Parquet-based web analytics query engine https://github.com/thunderain-project/thunderain;>Thunderain - a framework for combining stream processing with historical data, think Lambda architecture - https://github.com/AyasdiOpenSource/df;>DF from Ayasdi - a Pandas-like data frame -implementation for Spark https://github.com/OryxProject/oryx;>Oryx - Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning https://github.com/bigdatagenomics/adam;>ADAM - A framework and CLI for loading, diff --git a/third-party-projects.md b/third-party-projects.md index b08c18e13..1db600c35 100644 --- a/third-party-projects.md +++ b/third-party-projects.md @@ -52,14 +52,12 @@ against Spark, and data
[spark] branch master updated: [SPARK-40185][SQL] Remove column suggestion when the candidate list is empty
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 32567e94b8a [SPARK-40185][SQL] Remove column suggestion when the candidate list is empty 32567e94b8a is described below commit 32567e94b8ad2550d8b0b4d73e2dfd441d426ecc Author: Vitalii Li AuthorDate: Wed Sep 7 19:50:33 2022 +0800 [SPARK-40185][SQL] Remove column suggestion when the candidate list is empty ### What changes were proposed in this pull request? 1. Remove column, attribute or map key suggestion from `UNRESOLVED_*` error if candidate list is empty. 2. Sort suggested columns by closeness to unresolved column 3. Limit number of candidates to 5. Previously entire list of existing columns were shown as suggestions. ### Why are the changes needed? When the list of candidates is empty the error message looks incomplete: `[UNRESOLVED_COLUMN] A column or function parameter with name 'YrMo' cannot be resolved. Did you mean one of the following? []` This PR is to introduce `GENERIC` error subclass without suggestion and `WITH_SUGGESTION` subclass where error message includes suggested fields/columns: `[UNRESOLVED_COLUMN.GENERIC] A column or function parameter with name 'YrMo' cannot be resolved.` OR `[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name 'YrMo' cannot be resolved. Did you mean one of the following? ['YearAndMonth', 'Year', 'Month']` In addition suggested column names are sorted by Levenstein distance and capped to 5. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #37621 from vitaliili-db/SC-108622. Authored-by: Vitalii Li Signed-off-by: Wenchen Fan --- core/src/main/resources/error/error-classes.json | 42 +- .../org/apache/spark/SparkThrowableSuite.scala | 16 ++-- .../spark/sql/catalyst/analysis/Analyzer.scala | 11 ++- .../plans/logical/basicLogicalOperators.scala | 5 +- .../spark/sql/errors/QueryCompilationErrors.scala | 33 +--- .../sql/catalyst/analysis/AnalysisErrorSuite.scala | 21 - .../sql/catalyst/analysis/AnalysisSuite.scala | 12 ++- .../spark/sql/catalyst/analysis/AnalysisTest.scala | 23 +- .../catalyst/analysis/ResolveSubquerySuite.scala | 24 +++--- .../catalyst/analysis/V2WriteAnalysisSuite.scala | 14 +++- .../results/columnresolution-negative.sql.out | 12 ++- .../resources/sql-tests/results/group-by.sql.out | 6 +- .../sql-tests/results/join-lateral.sql.out | 15 ++-- .../sql-tests/results/natural-join.sql.out | 3 +- .../test/resources/sql-tests/results/pivot.sql.out | 6 +- .../results/postgreSQL/aggregates_part1.sql.out| 3 +- .../results/postgreSQL/create_view.sql.out | 4 +- .../sql-tests/results/postgreSQL/join.sql.out | 28 --- .../results/postgreSQL/select_having.sql.out | 3 +- .../results/postgreSQL/select_implicit.sql.out | 6 +- .../sql-tests/results/postgreSQL/union.sql.out | 3 +- .../sql-tests/results/query_regex_column.sql.out | 24 -- .../negative-cases/invalid-correlation.sql.out | 3 +- .../sql-tests/results/table-aliases.sql.out| 3 +- .../udf/postgreSQL/udf-aggregates_part1.sql.out| 3 +- .../results/udf/postgreSQL/udf-join.sql.out| 28 --- .../udf/postgreSQL/udf-select_having.sql.out | 3 +- .../udf/postgreSQL/udf-select_implicit.sql.out | 6 +- .../sql-tests/results/udf/udf-group-by.sql.out | 3 +- .../sql-tests/results/udf/udf-pivot.sql.out| 6 +- .../apache/spark/sql/DataFrameFunctionsSuite.scala | 89 -- .../apache/spark/sql/DataFrameSelfJoinSuite.scala | 3 +- .../apache/spark/sql/DataFrameToSchemaSuite.scala | 4 +- .../spark/sql/DataFrameWindowFunctionsSuite.scala | 9 ++- .../scala/org/apache/spark/sql/DatasetSuite.scala | 44 ++- .../org/apache/spark/sql/DatasetUnpivotSuite.scala | 9 ++- .../org/apache/spark/sql/SQLInsertTestSuite.scala | 10 ++- .../scala/org/apache/spark/sql/SubquerySuite.scala | 11 ++- .../test/scala/org/apache/spark/sql/UDFSuite.scala | 11 ++- .../spark/sql/connector/DataSourceV2SQLSuite.scala | 19 +++-- .../sql/errors/QueryCompilationErrorsSuite.scala | 11 ++- .../apache/spark/sql/execution/SQLViewSuite.scala | 6 +- .../execution/command/v2/DescribeTableSuite.scala | 6 +- .../org/apache/spark/sql/sources/InsertSuite.scala | 11 +-- .../apache/spark/sql/hive/HiveParquetSuite.scala | 9 ++- 45 files changed, 423 insertions(+), 198 deletions(-) diff --git a/core/src/main/resources/error/error-classes.json b/core/src/main/resources/error/error-classes.json index b923d5a39e0..f39ee465768 100644 ---
[spark] branch branch-3.3 updated: add back a mistakenly removed test case
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new 81cb08b7b3a add back a mistakenly removed test case 81cb08b7b3a is described below commit 81cb08b7b3ae6a4ccfa9787ec39a6041fae8143f Author: Wenchen Fan AuthorDate: Wed Sep 7 19:29:39 2022 +0800 add back a mistakenly removed test case --- .../spark/sql/connector/DataSourceV2SQLSuite.scala | 24 ++ 1 file changed, 24 insertions(+) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala index 304c77fd003..44f97f55713 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala @@ -2288,6 +2288,30 @@ class DataSourceV2SQLSuite } } + test("SPARK-34561: drop/add columns to a dataset of `DESCRIBE TABLE`") { +val tbl = s"${catalogAndNamespace}tbl" +withTable(tbl) { + sql(s"CREATE TABLE $tbl (c0 INT) USING $v2Format") + val description = sql(s"DESCRIBE TABLE $tbl") + val noCommentDataset = description.drop("comment") + val expectedSchema = new StructType() +.add( + name = "col_name", + dataType = StringType, + nullable = false, + metadata = new MetadataBuilder().putString("comment", "name of the column").build()) +.add( + name = "data_type", + dataType = StringType, + nullable = false, + metadata = new MetadataBuilder().putString("comment", "data type of the column").build()) + assert(noCommentDataset.schema === expectedSchema) + val isNullDataset = noCommentDataset +.withColumn("is_null", noCommentDataset("col_name").isNull) + assert(isNullDataset.schema === expectedSchema.add("is_null", BooleanType, false)) +} + } + test("SPARK-34576: drop/add columns to a dataset of `DESCRIBE COLUMN`") { val tbl = s"${catalogAndNamespace}tbl" withTable(tbl) { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.3 updated: [SPARK-40149][SQL] Propagate metadata columns through Project
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new 433469f284e [SPARK-40149][SQL] Propagate metadata columns through Project 433469f284e is described below commit 433469f284ee24150f6cff4005d39a70e91cc4d9 Author: Wenchen Fan AuthorDate: Wed Sep 7 18:45:20 2022 +0800 [SPARK-40149][SQL] Propagate metadata columns through Project This PR fixes a regression caused by https://github.com/apache/spark/pull/32017 . In https://github.com/apache/spark/pull/32017 , we tried to be more conservative and decided to not propagate metadata columns in certain operators, including `Project`. However, the decision was made only considering SQL API, not DataFrame API. In fact, it's very common to chain `Project` operators in DataFrame, e.g. `df.withColumn(...).withColumn(...)...`, and it's very inconvenient if metadata columns are not propagated through `Project`. This PR makes 2 changes: 1. Project should propagate metadata columns 2. SubqueryAlias should only propagate metadata columns if the child is a leaf node or also a SubqueryAlias The second change is needed to still forbid weird queries like `SELECT m from (SELECT a from t)`, which is the main motivation of https://github.com/apache/spark/pull/32017 . After propagating metadata columns, a problem from https://github.com/apache/spark/pull/31666 is exposed: the natural join metadata columns may confuse the analyzer and lead to wrong analyzed plan. For example, `SELECT t1.value FROM t1 LEFT JOIN t2 USING (key) ORDER BY key`, how shall we resolve `ORDER BY key`? It should be resolved to `t1.key` via the rule `ResolveMissingReferences`, which is in the output of the left join. However, if `Project` can propagate metadata columns, `ORDER [...] To solve this problem, this PR only allows qualified access for metadata columns of natural join. This has no breaking change, as people can only do qualified access for natural join metadata columns before, in the `Project` right after `Join`. This actually enables more use cases, as people can now access natural join metadata columns in ORDER BY. I've added a test for it. fix a regression For SQL API, there is no change, as a `SubqueryAlias` always comes with a `Project` or `Aggregate`, so we still don't propagate metadata columns through a SELECT group. For DataFrame API, the behavior becomes more lenient. The only breaking case is an operator that can propagate metadata columns then follows a `SubqueryAlias`, e.g. `df.filter(...).as("t").select("t.metadata_col")`. But this is a weird use case and I don't think we should support it at the first place. new tests Closes #37758 from cloud-fan/metadata. Authored-by: Wenchen Fan Signed-off-by: Wenchen Fan (cherry picked from commit 99ae1d9a897909990881f14c5ea70a0d1a0bf456) Signed-off-by: Wenchen Fan --- .../spark/sql/catalyst/analysis/Analyzer.scala | 8 +- .../spark/sql/catalyst/analysis/unresolved.scala | 2 +- .../spark/sql/catalyst/expressions/package.scala | 13 +- .../plans/logical/basicLogicalOperators.scala | 13 +- .../apache/spark/sql/catalyst/util/package.scala | 15 +- .../test/resources/sql-tests/inputs/using-join.sql | 2 + .../resources/sql-tests/results/using-join.sql.out | 11 + .../spark/sql/connector/DataSourceV2SQLSuite.scala | 242 - .../spark/sql/connector/MetadataColumnSuite.scala | 219 +++ 9 files changed, 263 insertions(+), 262 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala index 37024e15377..3a3997ff9c7 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala @@ -967,9 +967,11 @@ class Analyzer(override val catalogManager: CatalogManager) private def addMetadataCol(plan: LogicalPlan): LogicalPlan = plan match { case s: ExposesMetadataColumns => s.withMetadataColumns() case p: Project => -p.copy( +val newProj = p.copy( projectList = p.metadataOutput ++ p.projectList, child = addMetadataCol(p.child)) +newProj.copyTagsFrom(p) +newProj case _ => plan.withNewChildren(plan.children.map(addMetadataCol)) } } @@ -3475,8 +3477,8 @@ class Analyzer(override val catalogManager: CatalogManager) val project = Project(projectList, Join(left, right, joinType, newCondition, hint)) project.setTagValue( Project.hiddenOutputTag, -
[spark] branch master updated (e6c58c1bd6f -> 99ae1d9a897)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from e6c58c1bd6f [SPARK-40273][PYTHON][DOCS] Fix the documents "Contributing and Maintaining Type Hints" add 99ae1d9a897 [SPARK-40149][SQL] Propagate metadata columns through Project No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/analysis/Analyzer.scala | 8 +- .../spark/sql/catalyst/analysis/unresolved.scala | 2 +- .../spark/sql/catalyst/expressions/package.scala | 13 +- .../plans/logical/basicLogicalOperators.scala | 13 +- .../apache/spark/sql/catalyst/util/package.scala | 15 +- .../test/resources/sql-tests/inputs/using-join.sql | 2 + .../resources/sql-tests/results/using-join.sql.out | 11 ++ .../spark/sql/connector/DataSourceV2SQLSuite.scala | 218 .../spark/sql/connector/MetadataColumnSuite.scala | 219 + 9 files changed, 263 insertions(+), 238 deletions(-) create mode 100644 sql/core/src/test/scala/org/apache/spark/sql/connector/MetadataColumnSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-40273][PYTHON][DOCS] Fix the documents "Contributing and Maintaining Type Hints"
This is an automated email from the ASF dual-hosted git repository. zero323 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new e6c58c1bd6f [SPARK-40273][PYTHON][DOCS] Fix the documents "Contributing and Maintaining Type Hints" e6c58c1bd6f is described below commit e6c58c1bd6f64ebfb337348fa6132c0b230dc932 Author: itholic AuthorDate: Wed Sep 7 11:29:45 2022 +0200 [SPARK-40273][PYTHON][DOCS] Fix the documents "Contributing and Maintaining Type Hints" ### What changes were proposed in this pull request? This PR proposes to fix the [Contributing and Maintaining Type Hints](https://spark.apache.org/docs/latest/api/python/development/contributing.html#contributing-and-maintaining-type-hints) since the existing type hints in the stub files are all ported into inline type hints. ### Why are the changes needed? We no longer use the stub files for type hinting, so we might need to change the documents as well. ### Does this PR introduce _any_ user-facing change? Yes, the documentation change. ### How was this patch tested? The existing documentation build should pass Closes #37724 from itholic/SPARK-40273. Authored-by: itholic Signed-off-by: zero323 --- python/docs/source/development/contributing.rst | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/python/docs/source/development/contributing.rst b/python/docs/source/development/contributing.rst index 9780d6eca4e..3d388e91012 100644 --- a/python/docs/source/development/contributing.rst +++ b/python/docs/source/development/contributing.rst @@ -155,10 +155,7 @@ Now, you can start developing and `running the tests `_. Contributing and Maintaining Type Hints -PySpark type hints are provided using stub files, placed in the same directory as the annotated module, with exception to: - -* ``# type: ignore`` in modules which don't have their own stubs (tests, examples and non-public API). -* pandas API on Spark (``pyspark.pandas`` package) where the type hints are inlined. +PySpark type hints are inlined, to take advantage of static type checking. As a rule of thumb, only public API is annotated. @@ -166,7 +163,7 @@ Annotations should, when possible: * Reflect expectations of the underlying JVM API, to help avoid type related failures outside Python interpreter. * In case of conflict between too broad (``Any``) and too narrow argument annotations, prefer the latter as one, as long as it is covering most of the typical use cases. -* Indicate nonsensical combinations of arguments using ``@overload`` annotations. For example, to indicate that ``*Col`` and ``*Cols`` arguments are mutually exclusive: +* Indicate nonsensical combinations of arguments using ``@overload`` annotations. For example, to indicate that ``*Col`` and ``*Cols`` arguments are mutually exclusive: .. code-block:: python - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org