[spark] branch branch-3.3 updated: [SPARK-39896][SQL] UnwrapCastInBinaryComparison should work when the literal of In/InSet downcast failed
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new c04aa36b771 [SPARK-39896][SQL] UnwrapCastInBinaryComparison should work when the literal of In/InSet downcast failed c04aa36b771 is described below commit c04aa36b7713a7ebaf368fc2ad4065478e264d85 Author: Fu Chen AuthorDate: Wed Aug 31 13:32:17 2022 +0800 [SPARK-39896][SQL] UnwrapCastInBinaryComparison should work when the literal of In/InSet downcast failed ### Why are the changes needed? This PR aims to fix the case ```scala sql("create table t1(a decimal(3, 0)) using parquet") sql("insert into t1 values(100), (10), (1)") sql("select * from t1 where a in(10, 1.00)").show ``` ``` java.lang.RuntimeException: After applying rule org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch Operator Optimization before Inferring Filters, the structural integrity of the plan is broken. at org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1325) ``` 1. the rule `UnwrapCastInBinaryComparison` transforms the expression `In` to Equals ``` CAST(a as decimal(12,2)) IN (10.00,1.00) OR( CAST(a as decimal(12,2)) = 10.00, CAST(a as decimal(12,2)) = 1.00 ) ``` 2. using `UnwrapCastInBinaryComparison.unwrapCast()` to optimize each `EqualTo` ``` // Expression1 CAST(a as decimal(12,2)) = 10.00 => CAST(a as decimal(12,2)) = 10.00 // Expression2 CAST(a as decimal(12,2)) = 1.00 => a = 1 ``` 3. return the new unwrapped cast expression `In` ``` a IN (10.00, 1.00) ``` Before this PR: the method `UnwrapCastInBinaryComparison.unwrapCast()` returns the original expression when downcasting to a decimal type fails (the `Expression1`),returns the original expression if the downcast to the decimal type succeeds (the `Expression2`), the two expressions have different data type which would break the structural integrity ``` a IN (10.00, 1.00) | | decimal(12, 2) | decimal(3, 0) ``` After this PR: the PR transform the downcasting failed expression to `falseIfNotNull(fromExp)` ``` ((isnull(a) AND null) OR a IN (1.00) ``` ### Does this PR introduce _any_ user-facing change? No, only bug fix. ### How was this patch tested? Unit test. Closes #37439 from cfmcgrady/SPARK-39896. Authored-by: Fu Chen Signed-off-by: Wenchen Fan (cherry picked from commit 6e62b93f3d1ef7e2d6be0a3bb729ab9b2d55a36d) Signed-off-by: Wenchen Fan --- .../optimizer/UnwrapCastInBinaryComparison.scala | 131 ++--- .../UnwrapCastInBinaryComparisonSuite.scala| 68 --- 2 files changed, 113 insertions(+), 86 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala index 94e27379b74..f4a92760d22 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala @@ -17,7 +17,6 @@ package org.apache.spark.sql.catalyst.optimizer -import scala.collection.immutable.HashSet import scala.collection.mutable.ArrayBuffer import org.apache.spark.sql.catalyst.expressions._ @@ -145,80 +144,28 @@ object UnwrapCastInBinaryComparison extends Rule[LogicalPlan] { case in @ In(Cast(fromExp, toType: NumericType, _, _), list @ Seq(firstLit, _*)) if canImplicitlyCast(fromExp, toType, firstLit.dataType) && in.inSetConvertible => - // There are 3 kinds of literals in the list: - // 1. null literals - // 2. The literals that can cast to fromExp.dataType - // 3. The literals that cannot cast to fromExp.dataType - // null literals is special as we can cast null literals to any data type. - val (nullList, canCastList, cannotCastList) = -(ArrayBuffer[Literal](), ArrayBuffer[Literal](), ArrayBuffer[Expression]()) - list.foreach { -case lit @ Literal(null, _) => nullList += lit -case lit @ NonNullLiteral(_, _) => - unwrapCast(EqualTo(in.value, lit)) match { -case EqualTo(_, unwrapLit: Literal) => canCastList += unwrapLit -case e @ And(IsNull(_), Literal(null, BooleanType)) => cannotCastList += e -case _ => throw new
[spark] branch master updated: [SPARK-39896][SQL] UnwrapCastInBinaryComparison should work when the literal of In/InSet downcast failed
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 6e62b93f3d1 [SPARK-39896][SQL] UnwrapCastInBinaryComparison should work when the literal of In/InSet downcast failed 6e62b93f3d1 is described below commit 6e62b93f3d1ef7e2d6be0a3bb729ab9b2d55a36d Author: Fu Chen AuthorDate: Wed Aug 31 13:32:17 2022 +0800 [SPARK-39896][SQL] UnwrapCastInBinaryComparison should work when the literal of In/InSet downcast failed ### Why are the changes needed? This PR aims to fix the case ```scala sql("create table t1(a decimal(3, 0)) using parquet") sql("insert into t1 values(100), (10), (1)") sql("select * from t1 where a in(10, 1.00)").show ``` ``` java.lang.RuntimeException: After applying rule org.apache.spark.sql.catalyst.optimizer.UnwrapCastInBinaryComparison in batch Operator Optimization before Inferring Filters, the structural integrity of the plan is broken. at org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1325) ``` 1. the rule `UnwrapCastInBinaryComparison` transforms the expression `In` to Equals ``` CAST(a as decimal(12,2)) IN (10.00,1.00) OR( CAST(a as decimal(12,2)) = 10.00, CAST(a as decimal(12,2)) = 1.00 ) ``` 2. using `UnwrapCastInBinaryComparison.unwrapCast()` to optimize each `EqualTo` ``` // Expression1 CAST(a as decimal(12,2)) = 10.00 => CAST(a as decimal(12,2)) = 10.00 // Expression2 CAST(a as decimal(12,2)) = 1.00 => a = 1 ``` 3. return the new unwrapped cast expression `In` ``` a IN (10.00, 1.00) ``` Before this PR: the method `UnwrapCastInBinaryComparison.unwrapCast()` returns the original expression when downcasting to a decimal type fails (the `Expression1`),returns the original expression if the downcast to the decimal type succeeds (the `Expression2`), the two expressions have different data type which would break the structural integrity ``` a IN (10.00, 1.00) | | decimal(12, 2) | decimal(3, 0) ``` After this PR: the PR transform the downcasting failed expression to `falseIfNotNull(fromExp)` ``` ((isnull(a) AND null) OR a IN (1.00) ``` ### Does this PR introduce _any_ user-facing change? No, only bug fix. ### How was this patch tested? Unit test. Closes #37439 from cfmcgrady/SPARK-39896. Authored-by: Fu Chen Signed-off-by: Wenchen Fan --- .../optimizer/UnwrapCastInBinaryComparison.scala | 131 ++--- .../UnwrapCastInBinaryComparisonSuite.scala| 68 --- 2 files changed, 113 insertions(+), 86 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala index 94e27379b74..f4a92760d22 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala @@ -17,7 +17,6 @@ package org.apache.spark.sql.catalyst.optimizer -import scala.collection.immutable.HashSet import scala.collection.mutable.ArrayBuffer import org.apache.spark.sql.catalyst.expressions._ @@ -145,80 +144,28 @@ object UnwrapCastInBinaryComparison extends Rule[LogicalPlan] { case in @ In(Cast(fromExp, toType: NumericType, _, _), list @ Seq(firstLit, _*)) if canImplicitlyCast(fromExp, toType, firstLit.dataType) && in.inSetConvertible => - // There are 3 kinds of literals in the list: - // 1. null literals - // 2. The literals that can cast to fromExp.dataType - // 3. The literals that cannot cast to fromExp.dataType - // null literals is special as we can cast null literals to any data type. - val (nullList, canCastList, cannotCastList) = -(ArrayBuffer[Literal](), ArrayBuffer[Literal](), ArrayBuffer[Expression]()) - list.foreach { -case lit @ Literal(null, _) => nullList += lit -case lit @ NonNullLiteral(_, _) => - unwrapCast(EqualTo(in.value, lit)) match { -case EqualTo(_, unwrapLit: Literal) => canCastList += unwrapLit -case e @ And(IsNull(_), Literal(null, BooleanType)) => cannotCastList += e -case _ => throw new IllegalStateException("Illegal unwrap cast result found.") - } -case _ => throw new IllegalStateException("Illegal value
[spark] branch master updated: [SPARK-40271][PYTHON] Support list type for `pyspark.sql.functions.lit`
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 65d89f8e897 [SPARK-40271][PYTHON] Support list type for `pyspark.sql.functions.lit` 65d89f8e897 is described below commit 65d89f8e897449f7f026144a76328ff545fecde2 Author: itholic AuthorDate: Wed Aug 31 11:37:20 2022 +0800 [SPARK-40271][PYTHON] Support list type for `pyspark.sql.functions.lit` ### What changes were proposed in this pull request? This PR proposes to support `list` type for `pyspark.sql.functions.lit`. ### Why are the changes needed? To improve the API usability. ### Does this PR introduce _any_ user-facing change? Yes, now the `list` type is available for `pyspark.sql.functions.list` as below: - Before ```python >>> spark.range(3).withColumn("c", lit([1,2,3])).show() Traceback (most recent call last): ... : org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE.LITERAL_TYPE] The feature is not supported: Literal for '[1, 2, 3]' of class java.util.ArrayList. at org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:302) at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:100) at org.apache.spark.sql.functions$.lit(functions.scala:125) at org.apache.spark.sql.functions.lit(functions.scala) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) at java.base/java.lang.reflect.Method.invoke(Method.java:577) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Thread.java:833) ``` - After ```python >>> spark.range(3).withColumn("c", lit([1,2,3])).show() +---+-+ | id|c| +---+-+ | 0|[1, 2, 3]| | 1|[1, 2, 3]| | 2|[1, 2, 3]| +---+-+ ``` ### How was this patch tested? Added doctest & unit test. Closes #37722 from itholic/SPARK-40271. Authored-by: itholic Signed-off-by: Ruifeng Zheng --- python/pyspark/sql/functions.py| 23 +-- python/pyspark/sql/tests/test_functions.py | 26 ++ 2 files changed, 47 insertions(+), 2 deletions(-) diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py index 03c16db602f..e7a7a1b37f3 100644 --- a/python/pyspark/sql/functions.py +++ b/python/pyspark/sql/functions.py @@ -131,10 +131,13 @@ def lit(col: Any) -> Column: Parameters -- -col : :class:`~pyspark.sql.Column` or Python primitive type. +col : :class:`~pyspark.sql.Column`, str, int, float, bool or list. the value to make it as a PySpark literal. If a column is passed, it returns the column as is. +.. versionchanged:: 3.4.0 +Since 3.4.0, it supports the list type. + Returns --- :class:`~pyspark.sql.Column` @@ -149,8 +152,24 @@ def lit(col: Any) -> Column: +--+---+ | 5| 0| +--+---+ + +Create a literal from a list. + +>>> spark.range(1).select(lit([1, 2, 3])).show() ++--+ +|array(1, 2, 3)| ++--+ +| [1, 2, 3]| ++--+ """ -return col if isinstance(col, Column) else _invoke_function("lit", col) +if isinstance(col, Column): +return col +elif isinstance(col, list): +if any(isinstance(c, Column) for c in col): +raise ValueError("lit does not allow for list of Columns") +return array(*[lit(item) for item in col]) +else: +return _invoke_function("lit", col) def col(col: str) -> Column: diff --git a/python/pyspark/sql/tests/test_functions.py b/python/pyspark/sql/tests/test_functions.py index 102ebef8317..1d02a540558 100644 --- a/python/pyspark/sql/tests/test_functions.py +++ b/python/pyspark/sql/tests/test_functions.py @@ -962,6 +962,32 @@ class FunctionsTests(ReusedSQLTestCase): actual = self.spark.range(1).select(lit(td)).first()[0] self.assertEqual(actual, td) +def test_lit_list(self): +# SPARK-40271: added list type supporting +test_list = [1, 2, 3] +expected =
[spark] branch master updated (296fe49ec85 -> c2d4c4ed4c5)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 296fe49ec85 [SPARK-40260][SQL] Use error classes in the compilation errors of GROUP BY a position add c2d4c4ed4c5 [SPARK-40256][BUILD][K8S] Switch base image from openjdk to eclipse-temurin No new revisions were added by this update. Summary of changes: bin/docker-image-tool.sh | 4 ++-- .../kubernetes/docker/src/main/dockerfiles/spark/Dockerfile | 5 ++--- .../integration-tests/scripts/setup-integration-test-env.sh | 2 +- 3 files changed, 5 insertions(+), 6 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-40260][SQL] Use error classes in the compilation errors of GROUP BY a position
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 296fe49ec85 [SPARK-40260][SQL] Use error classes in the compilation errors of GROUP BY a position 296fe49ec85 is described below commit 296fe49ec855ac8c15c080e7bab6d519fe504bd3 Author: Max Gekk AuthorDate: Tue Aug 30 20:43:19 2022 +0300 [SPARK-40260][SQL] Use error classes in the compilation errors of GROUP BY a position ### What changes were proposed in this pull request? In the PR, I propose to the following new error classes: - GROUP_BY_POS_OUT_OF_RANGE - GROUP_BY_POS_REFERS_AGG_EXPR and migrate 2 compilation exceptions related to GROUP BY a position onto them. ### Why are the changes needed? The migration onto error classes makes the errors searchable in docs, and allows to edit error's text messages w/o modifying the source code. ### Does this PR introduce _any_ user-facing change? Yes, in some sense because it modifies user-facing error messages. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" $ build/sbt "core/testOnly *SparkThrowableSuite" ``` Closes #37712 from MaxGekk/group-ref-agg-error. Lead-authored-by: Max Gekk Co-authored-by: Maxim Gekk Signed-off-by: Max Gekk --- core/src/main/resources/error/error-classes.json | 12 .../org/apache/spark/sql/AnalysisException.scala | 2 +- .../spark/sql/errors/QueryCompilationErrors.scala | 11 +-- .../sql-tests/results/group-by-ordinal.sql.out | 81 +++--- .../results/postgreSQL/select_implicit.sql.out | 9 ++- .../udf/postgreSQL/udf-select_implicit.sql.out | 9 ++- 6 files changed, 107 insertions(+), 17 deletions(-) diff --git a/core/src/main/resources/error/error-classes.json b/core/src/main/resources/error/error-classes.json index 816df79e508..df0f887a63c 100644 --- a/core/src/main/resources/error/error-classes.json +++ b/core/src/main/resources/error/error-classes.json @@ -136,6 +136,18 @@ "Grouping sets size cannot be greater than " ] }, + "GROUP_BY_POS_OUT_OF_RANGE" : { +"message" : [ + "GROUP BY position is not in select list (valid range is [1, ])." +], +"sqlState" : "42000" + }, + "GROUP_BY_POS_REFERS_AGG_EXPR" : { +"message" : [ + "GROUP BY refers to an expression that contains an aggregate function. Aggregate functions are not allowed in GROUP BY." +], +"sqlState" : "42000" + }, "INCOMPARABLE_PIVOT_COLUMN" : { "message" : [ "Invalid pivot column . Pivot columns must be comparable." diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/AnalysisException.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/AnalysisException.scala index 9ab0b223e11..48e1f91990b 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/AnalysisException.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/AnalysisException.scala @@ -100,7 +100,7 @@ class AnalysisException protected[sql] ( line = origin.line, startPosition = origin.startPosition, errorClass = Some(errorClass), - errorSubClass = Some(errorSubClass), + errorSubClass = Option(errorSubClass), messageParameters = messageParameters) def copy( diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala index 20c3c81b250..7458e201be2 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala @@ -366,14 +366,15 @@ private[sql] object QueryCompilationErrors extends QueryErrorsBase { def groupByPositionRefersToAggregateFunctionError( index: Int, expr: Expression): Throwable = { -new AnalysisException(s"GROUP BY $index refers to an expression that is or contains " + - "an aggregate function. Aggregate functions are not allowed in GROUP BY, " + - s"but got ${expr.sql}") +new AnalysisException( + errorClass = "GROUP_BY_POS_REFERS_AGG_EXPR", + messageParameters = Array(index.toString, expr.sql)) } def groupByPositionRangeError(index: Int, size: Int): Throwable = { -new AnalysisException(s"GROUP BY position $index is not in select list " + - s"(valid range is [1, $size])") +new AnalysisException( + errorClass = "GROUP_BY_POS_OUT_OF_RANGE", + messageParameters = Array(index.toString, size.toString)) } def generatorNotExpectedError(name: FunctionIdentifier, classCanonicalName: String): Throwable = { diff --git
[spark] branch master updated (64dd81d97da -> cee30cb4994)
This is an automated email from the ASF dual-hosted git repository. huaxingao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 64dd81d97da [SPARK-40056][BUILD] Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9 add cee30cb4994 [SPARK-40113][SQL] Reactor ParquetScanBuilder DataSourceV2 interface implementations No new revisions were added by this update. Summary of changes: .../v2/parquet/ParquetScanBuilder.scala| 26 -- 1 file changed, 9 insertions(+), 17 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (c5e93eb6631 -> 64dd81d97da)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from c5e93eb6631 [SPARK-40207][SQL] Specify the column name when the data type is not supported by datasource add 64dd81d97da [SPARK-40056][BUILD] Upgrade mvn-scalafmt from 1.0.4 to 1.1.1640084764.9f463a9 No new revisions were added by this update. Summary of changes: dev/.scalafmt.conf | 11 +-- dev/scalafmt | 2 +- pom.xml| 2 +- 3 files changed, 11 insertions(+), 4 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-40207][SQL] Specify the column name when the data type is not supported by datasource
This is an automated email from the ASF dual-hosted git repository. yumwang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c5e93eb6631 [SPARK-40207][SQL] Specify the column name when the data type is not supported by datasource c5e93eb6631 is described below commit c5e93eb6631b2620908cef0fa3b100f6e3569d68 Author: yikf AuthorDate: Tue Aug 30 19:05:23 2022 +0800 [SPARK-40207][SQL] Specify the column name when the data type is not supported by datasource ### What changes were proposed in this pull request? Currently, If the data type is not supported by the data source, the exception message thrown does not contain the column name, which is less clear for locating the problem This is a minor fix and aims to specify the column name when the data type is not supported by datasource ### Why are the changes needed? More explicit error messages ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA Closes #37574 from Yikf/void-type. Authored-by: yikf Signed-off-by: Yuming Wang --- .../org/apache/spark/sql/avro/AvroSuite.scala | 6 +- .../spark/sql/errors/QueryCompilationErrors.scala | 5 +- .../spark/sql/FileBasedDataSourceSuite.scala | 68 +- .../spark/sql/hive/execution/HiveDDLSuite.scala| 4 +- .../spark/sql/hive/orc/HiveOrcSourceSuite.scala| 16 +++-- 5 files changed, 60 insertions(+), 39 deletions(-) diff --git a/connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala b/connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala index bfdeb11fd8a..5e161948932 100644 --- a/connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala +++ b/connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala @@ -1181,14 +1181,16 @@ abstract class AvroSuite sql("select interval 1 days").write.format("avro").mode("overwrite").save(tempDir) }.getMessage assert(msg.contains("Cannot save interval data type into external storage.") || - msg.contains("AVRO data source does not support interval data type.")) + msg.contains("Column `INTERVAL '1' DAY` has a data type of interval day, " + +"which is not supported by Avro.")) msg = intercept[AnalysisException] { spark.udf.register("testType", () => new IntervalData()) sql("select testType()").write.format("avro").mode("overwrite").save(tempDir) }.getMessage assert(msg.toLowerCase(Locale.ROOT) - .contains(s"avro data source does not support interval data type.")) + .contains("column `testtype()` has a data type of interval, " + +"which is not supported by avro.")) } } } diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala index 834e0e6b214..20c3c81b250 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala @@ -1190,8 +1190,9 @@ private[sql] object QueryCompilationErrors extends QueryErrorsBase { } def dataTypeUnsupportedByDataSourceError(format: String, field: StructField): Throwable = { -new AnalysisException( - s"$format data source does not support ${field.dataType.catalogString} data type.") +new AnalysisException(s"Column `${field.name}` has a data type of " + + s"${field.dataType.catalogString}, which is not supported by $format." +) } def failToResolveDataSourceForTableError(table: CatalogTable, key: String): Throwable = { diff --git a/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala index 15d367dba88..98cb54ccbbc 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala @@ -257,38 +257,44 @@ class FileBasedDataSourceSuite extends QueryTest // Text file format only supports string type test("SPARK-24691 error handling for unsupported types - text") { withTempDir { dir => + def validateErrorMessage(msg: String, column: String, dt: String, format: String): Unit = { +val excepted = s"Column `$column` has a data type of $dt, " + + s"which is not supported by $format." +assert(msg.contains(excepted)) + } + // write path val textDir = new File(dir, "text").getCanonicalPath var msg = intercept[AnalysisException] { Seq(1).toDF.write.text(textDir)
[spark] branch master updated: [SPARK-40156][SQL][TESTS][FOLLOW-UP] `url_decode()` should the return an error class
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 7a5ce219d0d [SPARK-40156][SQL][TESTS][FOLLOW-UP] `url_decode()` should the return an error class 7a5ce219d0d is described below commit 7a5ce219d0d755235931bb1e36e0195884f58588 Author: zhiming she <505306...@qq.com> AuthorDate: Tue Aug 30 11:14:30 2022 +0300 [SPARK-40156][SQL][TESTS][FOLLOW-UP] `url_decode()` should the return an error class ### What changes were proposed in this pull request? url_decode() return an error class when Invalid parameter input. like : ``` spark.sql("SELECT url_decode('http%3A%2F%2spark.apache.org')").show ``` output: ``` org.apache.spark.SparkIllegalArgumentException: [CANNOT_DECODE_URL] Cannot decode url : http%3A%2F%2spark.apache.org. ``` ### Why are the changes needed? To improve user experience w/ Spark SQL. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ``` $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z url-functions.sql" ``` Closes #37709 from ming95/SPARK-40156-test. Authored-by: zhiming she <505306...@qq.com> Signed-off-by: Max Gekk --- .../main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala index d8d6139e919..8cb31f45c25 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala @@ -325,8 +325,7 @@ private[sql] object QueryExecutionErrors extends QueryErrorsBase { s"If necessary set ${SQLConf.ANSI_ENABLED.key} to false to bypass this error.", e) } - def illegalUrlError(url: UTF8String): - Throwable with SparkThrowable = { + def illegalUrlError(url: UTF8String): Throwable = { new SparkIllegalArgumentException(errorClass = "CANNOT_DECODE_URL", messageParameters = Array(url.toString) ) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (6deb8aa386a -> 44213742315)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 6deb8aa386a [SPARK-39616][FOLLOWUP] Delete the license of fommil-netlib add 44213742315 [SPARK-40266][DOCS] Datatype Integer instead of Long in quick-start docs No new revisions were added by this update. Summary of changes: docs/quick-start.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-39616][FOLLOWUP] Delete the license of fommil-netlib
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 6deb8aa386a [SPARK-39616][FOLLOWUP] Delete the license of fommil-netlib 6deb8aa386a is described below commit 6deb8aa386a9c3e16abd58a189976709675c2341 Author: Ruifeng Zheng AuthorDate: Tue Aug 30 16:35:06 2022 +0900 [SPARK-39616][FOLLOWUP] Delete the license of fommil-netlib ### What changes were proposed in this pull request? Delete the license of fommil-netlib ### Why are the changes needed? after upgrading breeze to 2.0 in https://github.com/apache/spark/pull/37002, fommil-netlib no longer exists in spark build ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes #37717 from zhengruifeng/del_netlib_license. Authored-by: Ruifeng Zheng Signed-off-by: Hyukjin Kwon --- licenses-binary/LICENSE-netlib.txt | 49 -- 1 file changed, 49 deletions(-) diff --git a/licenses-binary/LICENSE-netlib.txt b/licenses-binary/LICENSE-netlib.txt deleted file mode 100644 index 75783ed6bc3..000 --- a/licenses-binary/LICENSE-netlib.txt +++ /dev/null @@ -1,49 +0,0 @@ -Copyright (c) 2013 Samuel Halliday -Copyright (c) 1992-2011 The University of Tennessee and The University -of Tennessee Research Foundation. All rights -reserved. -Copyright (c) 2000-2011 The University of California Berkeley. All -rights reserved. -Copyright (c) 2006-2011 The University of Colorado Denver. All rights -reserved. - -$COPYRIGHT$ - -Additional copyrights may follow - -$HEADER$ - -Redistribution and use in source and binary forms, with or without -modification, are permitted provided that the following conditions are -met: - -- Redistributions of source code must retain the above copyright - notice, this list of conditions and the following disclaimer. - -- Redistributions in binary form must reproduce the above copyright - notice, this list of conditions and the following disclaimer listed - in this license in the documentation and/or other materials - provided with the distribution. - -- Neither the name of the copyright holders nor the names of its - contributors may be used to endorse or promote products derived from - this software without specific prior written permission. - -The copyright holders provide no reassurances that the source code -provided does not infringe any patent, copyright, or any other -intellectual property rights of third parties. The copyright holders -disclaim any liability to any recipient for claims brought against -recipient by any third party for infringement of that parties -intellectual property rights. - -THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS -"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT -LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR -A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT -OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, -SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT -LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, -DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY -THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT -(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE -OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. \ No newline at end of file - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-40270][PS] Make 'compute.max_rows' as None working in DataFrame.style
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 58375a86e6f [SPARK-40270][PS] Make 'compute.max_rows' as None working in DataFrame.style 58375a86e6f is described below commit 58375a86e6ff49c5bcee49939fbd98eb848ae59f Author: Hyukjin Kwon AuthorDate: Tue Aug 30 16:25:26 2022 +0900 [SPARK-40270][PS] Make 'compute.max_rows' as None working in DataFrame.style This PR make `compute.max_rows` option as `None` working in `DataFrame.style`, as expected instead of throwing an exception., by collecting it all to a pandas DataFrame. To make the configuration working as expected. Yes. ```python import pyspark.pandas as ps ps.set_option("compute.max_rows", None) ps.get_option("compute.max_rows") ps.range(1).style ``` **Before:** ``` Traceback (most recent call last): File "", line 1, in File "/.../spark/python/pyspark/pandas/frame.py", line 3656, in style pdf = self.head(max_results + 1)._to_internal_pandas() TypeError: unsupported operand type(s) for +: 'NoneType' and 'int' ``` **After:** ``` ``` Manually tested, and unittest was added. Closes #37718 from HyukjinKwon/SPARK-40270. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon (cherry picked from commit 0f0e8cc26b6c80cc179368e3009d4d6c88103a64) Signed-off-by: Hyukjin Kwon --- python/pyspark/pandas/frame.py| 16 +--- python/pyspark/pandas/tests/test_dataframe.py | 23 +++ 2 files changed, 32 insertions(+), 7 deletions(-) diff --git a/python/pyspark/pandas/frame.py b/python/pyspark/pandas/frame.py index efc677b33ce..fd112357bdd 100644 --- a/python/pyspark/pandas/frame.py +++ b/python/pyspark/pandas/frame.py @@ -3503,19 +3503,21 @@ defaultdict(, {'col..., 'col...})] Property returning a Styler object containing methods for building a styled HTML representation for the DataFrame. -.. note:: currently it collects top 1000 rows and return its -pandas `pandas.io.formats.style.Styler` instance. - Examples >>> ps.range(1001).style # doctest: +SKIP """ max_results = get_option("compute.max_rows") -pdf = self.head(max_results + 1)._to_internal_pandas() -if len(pdf) > max_results: -warnings.warn("'style' property will only use top %s rows." % max_results, UserWarning) -return pdf.head(max_results).style +if max_results is not None: +pdf = self.head(max_results + 1)._to_internal_pandas() +if len(pdf) > max_results: +warnings.warn( +"'style' property will only use top %s rows." % max_results, UserWarning +) +return pdf.head(max_results).style +else: +return self._to_internal_pandas().style def set_index( self, diff --git a/python/pyspark/pandas/tests/test_dataframe.py b/python/pyspark/pandas/tests/test_dataframe.py index 27c670026d0..b4187d59ae7 100644 --- a/python/pyspark/pandas/tests/test_dataframe.py +++ b/python/pyspark/pandas/tests/test_dataframe.py @@ -5774,6 +5774,29 @@ class DataFrameTest(PandasOnSparkTestCase, SQLTestUtils): for value_psdf, value_pdf in zip(psdf, pdf): self.assert_eq(value_psdf, value_pdf) +def test_style(self): +# Currently, the `style` function returns a pandas object `Styler` as it is, +# processing only the number of rows declared in `compute.max_rows`. +# So it's a bit vague to test, but we are doing minimal tests instead of not testing at all. +pdf = pd.DataFrame(np.random.randn(10, 4), columns=["A", "B", "C", "D"]) +psdf = ps.from_pandas(pdf) + +def style_negative(v, props=""): +return props if v < 0 else None + +def check_style(): +# If the value is negative, the text color will be displayed as red. +pdf_style = pdf.style.applymap(style_negative, props="color:red;") +psdf_style = psdf.style.applymap(style_negative, props="color:red;") + +# Test whether the same shape as pandas table is created including the color. +self.assert_eq(pdf_style.to_latex(), psdf_style.to_latex()) + +check_style() + +with ps.option_context("compute.max_rows", None): +check_style() + if __name__ == "__main__": from pyspark.pandas.tests.test_dataframe import * # noqa: F401 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail:
[spark] branch branch-3.3 updated: [SPARK-40270][PS] Make 'compute.max_rows' as None working in DataFrame.style
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new e46d2e2d476 [SPARK-40270][PS] Make 'compute.max_rows' as None working in DataFrame.style e46d2e2d476 is described below commit e46d2e2d476e85024f1c53fdaf446fdd2e293d28 Author: Hyukjin Kwon AuthorDate: Tue Aug 30 16:25:26 2022 +0900 [SPARK-40270][PS] Make 'compute.max_rows' as None working in DataFrame.style This PR make `compute.max_rows` option as `None` working in `DataFrame.style`, as expected instead of throwing an exception., by collecting it all to a pandas DataFrame. To make the configuration working as expected. Yes. ```python import pyspark.pandas as ps ps.set_option("compute.max_rows", None) ps.get_option("compute.max_rows") ps.range(1).style ``` **Before:** ``` Traceback (most recent call last): File "", line 1, in File "/.../spark/python/pyspark/pandas/frame.py", line 3656, in style pdf = self.head(max_results + 1)._to_internal_pandas() TypeError: unsupported operand type(s) for +: 'NoneType' and 'int' ``` **After:** ``` ``` Manually tested, and unittest was added. Closes #37718 from HyukjinKwon/SPARK-40270. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon (cherry picked from commit 0f0e8cc26b6c80cc179368e3009d4d6c88103a64) Signed-off-by: Hyukjin Kwon --- python/pyspark/pandas/frame.py| 16 +--- python/pyspark/pandas/tests/test_dataframe.py | 23 +++ 2 files changed, 32 insertions(+), 7 deletions(-) diff --git a/python/pyspark/pandas/frame.py b/python/pyspark/pandas/frame.py index 6e8f69ad6e7..e9c5cbb9c1e 100644 --- a/python/pyspark/pandas/frame.py +++ b/python/pyspark/pandas/frame.py @@ -3459,19 +3459,21 @@ defaultdict(, {'col..., 'col...})] Property returning a Styler object containing methods for building a styled HTML representation for the DataFrame. -.. note:: currently it collects top 1000 rows and return its -pandas `pandas.io.formats.style.Styler` instance. - Examples >>> ps.range(1001).style # doctest: +SKIP """ max_results = get_option("compute.max_rows") -pdf = self.head(max_results + 1)._to_internal_pandas() -if len(pdf) > max_results: -warnings.warn("'style' property will only use top %s rows." % max_results, UserWarning) -return pdf.head(max_results).style +if max_results is not None: +pdf = self.head(max_results + 1)._to_internal_pandas() +if len(pdf) > max_results: +warnings.warn( +"'style' property will only use top %s rows." % max_results, UserWarning +) +return pdf.head(max_results).style +else: +return self._to_internal_pandas().style def set_index( self, diff --git a/python/pyspark/pandas/tests/test_dataframe.py b/python/pyspark/pandas/tests/test_dataframe.py index 1cc03bf06f8..0a7eda77564 100644 --- a/python/pyspark/pandas/tests/test_dataframe.py +++ b/python/pyspark/pandas/tests/test_dataframe.py @@ -6375,6 +6375,29 @@ class DataFrameTest(ComparisonTestBase, SQLTestUtils): psdf = ps.from_pandas(pdf) self.assert_eq(pdf.cov(), psdf.cov()) +def test_style(self): +# Currently, the `style` function returns a pandas object `Styler` as it is, +# processing only the number of rows declared in `compute.max_rows`. +# So it's a bit vague to test, but we are doing minimal tests instead of not testing at all. +pdf = pd.DataFrame(np.random.randn(10, 4), columns=["A", "B", "C", "D"]) +psdf = ps.from_pandas(pdf) + +def style_negative(v, props=""): +return props if v < 0 else None + +def check_style(): +# If the value is negative, the text color will be displayed as red. +pdf_style = pdf.style.applymap(style_negative, props="color:red;") +psdf_style = psdf.style.applymap(style_negative, props="color:red;") + +# Test whether the same shape as pandas table is created including the color. +self.assert_eq(pdf_style.to_latex(), psdf_style.to_latex()) + +check_style() + +with ps.option_context("compute.max_rows", None): +check_style() + if __name__ == "__main__": from pyspark.pandas.tests.test_dataframe import * # noqa: F401 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-40270][PS] Make 'compute.max_rows' as None working in DataFrame.style
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 0f0e8cc26b6 [SPARK-40270][PS] Make 'compute.max_rows' as None working in DataFrame.style 0f0e8cc26b6 is described below commit 0f0e8cc26b6c80cc179368e3009d4d6c88103a64 Author: Hyukjin Kwon AuthorDate: Tue Aug 30 16:25:26 2022 +0900 [SPARK-40270][PS] Make 'compute.max_rows' as None working in DataFrame.style ### What changes were proposed in this pull request? This PR make `compute.max_rows` option as `None` working in `DataFrame.style`, as expected instead of throwing an exception., by collecting it all to a pandas DataFrame. ### Why are the changes needed? To make the configuration working as expected. ### Does this PR introduce _any_ user-facing change? Yes. ```python import pyspark.pandas as ps ps.set_option("compute.max_rows", None) ps.get_option("compute.max_rows") ps.range(1).style ``` **Before:** ``` Traceback (most recent call last): File "", line 1, in File "/.../spark/python/pyspark/pandas/frame.py", line 3656, in style pdf = self.head(max_results + 1)._to_internal_pandas() TypeError: unsupported operand type(s) for +: 'NoneType' and 'int' ``` **After:** ``` ``` ### How was this patch tested? Manually tested, and unittest was added. Closes #37718 from HyukjinKwon/SPARK-40270. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon --- python/pyspark/pandas/frame.py| 16 +--- python/pyspark/pandas/tests/test_dataframe.py | 16 +++- 2 files changed, 20 insertions(+), 12 deletions(-) diff --git a/python/pyspark/pandas/frame.py b/python/pyspark/pandas/frame.py index ba5df94c86c..8fc425e88a3 100644 --- a/python/pyspark/pandas/frame.py +++ b/python/pyspark/pandas/frame.py @@ -3754,19 +3754,21 @@ defaultdict(, {'col..., 'col...})] Property returning a Styler object containing methods for building a styled HTML representation for the DataFrame. -.. note:: currently it collects top 1000 rows and return its -pandas `pandas.io.formats.style.Styler` instance. - Examples >>> ps.range(1001).style # doctest: +SKIP """ max_results = get_option("compute.max_rows") -pdf = self.head(max_results + 1)._to_internal_pandas() -if len(pdf) > max_results: -warnings.warn("'style' property will only use top %s rows." % max_results, UserWarning) -return pdf.head(max_results).style +if max_results is not None: +pdf = self.head(max_results + 1)._to_internal_pandas() +if len(pdf) > max_results: +warnings.warn( +"'style' property will only use top %s rows." % max_results, UserWarning +) +return pdf.head(max_results).style +else: +return self._to_internal_pandas().style def set_index( self, diff --git a/python/pyspark/pandas/tests/test_dataframe.py b/python/pyspark/pandas/tests/test_dataframe.py index 2ab908fed00..34480152f8c 100644 --- a/python/pyspark/pandas/tests/test_dataframe.py +++ b/python/pyspark/pandas/tests/test_dataframe.py @@ -6904,12 +6904,18 @@ class DataFrameTest(ComparisonTestBase, SQLTestUtils): def style_negative(v, props=""): return props if v < 0 else None -# If the value is negative, the text color will be displayed as red. -pdf_style = pdf.style.applymap(style_negative, props="color:red;") -psdf_style = psdf.style.applymap(style_negative, props="color:red;") +def check_style(): +# If the value is negative, the text color will be displayed as red. +pdf_style = pdf.style.applymap(style_negative, props="color:red;") +psdf_style = psdf.style.applymap(style_negative, props="color:red;") -# Test whether the same shape as pandas table is created including the color. -self.assert_eq(pdf_style.to_latex(), psdf_style.to_latex()) +# Test whether the same shape as pandas table is created including the color. +self.assert_eq(pdf_style.to_latex(), psdf_style.to_latex()) + +check_style() + +with ps.option_context("compute.max_rows", None): +check_style() if __name__ == "__main__": - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-39915][SQL] Dataset.repartition(N) may not create N partitions Non-AQE part
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ff7ab34a695 [SPARK-39915][SQL] Dataset.repartition(N) may not create N partitions Non-AQE part ff7ab34a695 is described below commit ff7ab34a6957965d74504ed1d36de21d1ad7319c Author: ulysses-you AuthorDate: Tue Aug 30 14:30:41 2022 +0800 [SPARK-39915][SQL] Dataset.repartition(N) may not create N partitions Non-AQE part ### What changes were proposed in this pull request? Skip optimize the root user-specified repartition in `PropagateEmptyRelation`. ### Why are the changes needed? Spark should preserve the final repatition which can affect the final output partition which is user-specified. For example: ```scala spark.sql("select * from values(1) where 1 < rand()").repartition(1) // before: == Optimized Logical Plan == LocalTableScan , [col1#0] // after: == Optimized Logical Plan == Repartition 1, true +- LocalRelation , [col1#0] ``` ### Does this PR introduce _any_ user-facing change? yes, the empty plan may change ### How was this patch tested? add test Closes #37706 from ulysses-you/empty. Authored-by: ulysses-you Signed-off-by: Wenchen Fan --- .../apache/spark/sql/catalyst/dsl/package.scala| 3 ++ .../optimizer/PropagateEmptyRelation.scala | 42 -- .../optimizer/PropagateEmptyRelationSuite.scala| 38 .../adaptive/AQEPropagateEmptyRelation.scala | 2 +- .../org/apache/spark/sql/DataFrameSuite.scala | 7 5 files changed, 88 insertions(+), 4 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala index 118d3e85b71..86d85abc6f3 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala @@ -501,6 +501,9 @@ package object dsl { def repartition(num: Integer): LogicalPlan = Repartition(num, shuffle = true, logicalPlan) + def repartition(): LogicalPlan = +RepartitionByExpression(Seq.empty, logicalPlan, None) + def distribute(exprs: Expression*)(n: Int): LogicalPlan = RepartitionByExpression(exprs, logicalPlan, numPartitions = n) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelation.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelation.scala index 18c344f10f6..9e864d036ef 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelation.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelation.scala @@ -23,6 +23,7 @@ import org.apache.spark.sql.catalyst.expressions.Literal.FalseLiteral import org.apache.spark.sql.catalyst.plans._ import org.apache.spark.sql.catalyst.plans.logical._ import org.apache.spark.sql.catalyst.rules._ +import org.apache.spark.sql.catalyst.trees.TreeNodeTag import org.apache.spark.sql.catalyst.trees.TreePattern.{LOCAL_RELATION, TRUE_OR_FALSE_LITERAL} /** @@ -44,6 +45,9 @@ import org.apache.spark.sql.catalyst.trees.TreePattern.{LOCAL_RELATION, TRUE_OR_ * - Generate(Explode) with all empty children. Others like Hive UDTF may return results. */ abstract class PropagateEmptyRelationBase extends Rule[LogicalPlan] with CastSupport { + // This tag is used to mark a repartition as a root repartition which is user-specified + private[sql] val ROOT_REPARTITION = TreeNodeTag[Unit]("ROOT_REPARTITION") + protected def isEmpty(plan: LogicalPlan): Boolean = plan match { case p: LocalRelation => p.data.isEmpty case _ => false @@ -137,8 +141,13 @@ abstract class PropagateEmptyRelationBase extends Rule[LogicalPlan] with CastSup case _: GlobalLimit if !p.isStreaming => empty(p) case _: LocalLimit if !p.isStreaming => empty(p) case _: Offset => empty(p) - case _: Repartition => empty(p) - case _: RepartitionByExpression => empty(p) + case _: RepartitionOperation => +if (p.getTagValue(ROOT_REPARTITION).isEmpty) { + empty(p) +} else { + p.unsetTagValue(ROOT_REPARTITION) + p +} case _: RebalancePartitions => empty(p) // An aggregate with non-empty group expression will return one output row per group when the // input to the aggregate is not empty. If the input to the aggregate is empty then all groups @@ -162,13 +171,40 @@ abstract class PropagateEmptyRelationBase extends Rule[LogicalPlan] with CastSup
[spark] branch master updated (ca17135b288 -> e0cb2eb104e)
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from ca17135b288 [SPARK-34777][UI] Fix StagePage input size/records not show when records greater than zero add e0cb2eb104e [SPARK-40135][PS] Support `data` mixed with `index` in DataFrame creation No new revisions were added by this update. Summary of changes: python/pyspark/pandas/frame.py| 177 +++ python/pyspark/pandas/tests/test_dataframe.py | 239 +- 2 files changed, 377 insertions(+), 39 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org