[spark] branch branch-3.5 updated: [SPARK-43744][CONNECT] Fix class loading problem caused by stub user classes not found on the server classpath
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 408c3df01d2 [SPARK-43744][CONNECT] Fix class loading problem caused by stub user classes not found on the server classpath 408c3df01d2 is described below commit 408c3df01d20bb539929bdfdb236f899c428a12e Author: Zhen Li AuthorDate: Fri Jul 28 22:59:07 2023 -0400 [SPARK-43744][CONNECT] Fix class loading problem caused by stub user classes not found on the server classpath ### What changes were proposed in this pull request? This PR introduces a stub class loader for unpacking Scala UDFs in the driver and the executor. When encountering user classes that are not found on the server session classpath, the stub class loader would try to stub the class. This solves the problem that when serializing UDFs, Java serializer might include unnecessary user code e.g. User classes used in the lambda definition signatures in the same class where the UDF is defined. If the user code is actually needed to execute the UDF, we will return an error message to suggest the user to add the missing classes using the `addArtifact` method. ### Why are the changes needed? To enhance the user experience of UDF. This PR should be merged to master and 3.5. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added test both for Scala 2.12 & 2.13 4 tests in SparkSessionE2ESuite still fail to run with maven after the fix because the client test jar is installed on the system classpath (added using --jar at server start), the stub classloader can only stub classes missing from the session classpath (added using `session.addArtifact`). Moving the test jar to the session classpath causes failures in tests for `flatMapGroupsWithState` (SPARK-44576). Finish moving the test jar to session classpath once `flatMapGroupsWithState` test failures are fixed. Closes #42069 from zhenlineo/ref-spark-result. Authored-by: Zhen Li Signed-off-by: Herman van Hovell (cherry picked from commit 6d0fed9a18ff87e73fdf1ee46b6b0d2df8dd5a1b) Signed-off-by: Herman van Hovell --- .../scala/org/apache/spark/sql/SparkSession.scala | 2 +- .../sql/expressions/UserDefinedFunction.scala | 2 +- .../jvm/src/test/resources/StubClassDummyUdf.scala | 56 + .../connect/client/jvm/src/test/resources/udf2.12 | Bin 0 -> 1520 bytes .../client/jvm/src/test/resources/udf2.12.jar | Bin 0 -> 5332 bytes .../connect/client/jvm/src/test/resources/udf2.13 | Bin 0 -> 1630 bytes .../client/jvm/src/test/resources/udf2.13.jar | Bin 0 -> 5674 bytes .../connect/client/UDFClassLoadingE2ESuite.scala | 83 + .../connect/client/util/IntegrationTestUtils.scala | 2 +- .../connect/client/util/RemoteSparkSession.scala | 2 +- .../artifact/SparkConnectArtifactManager.scala | 17 ++- .../sql/connect/planner/SparkConnectPlanner.scala | 23 +++- connector/connect/server/src/test/resources/udf| Bin 0 -> 973 bytes .../connect/server/src/test/resources/udf_noA.jar | Bin 0 -> 5545 bytes .../connect/artifact/StubClassLoaderSuite.scala| 132 + .../spark/util/ChildFirstURLClassLoader.java | 9 ++ .../scala/org/apache/spark/executor/Executor.scala | 86 +++--- .../org/apache/spark/internal/config/package.scala | 14 +++ .../org/apache/spark/util/StubClassLoader.scala| 79 19 files changed, 480 insertions(+), 27 deletions(-) diff --git a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala index d1832e65f3e..4b3de91b56f 100644 --- a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala +++ b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala @@ -554,7 +554,7 @@ class SparkSession private[sql] ( val command = proto.Command.newBuilder().setRegisterFunction(udf).build() val plan = proto.Plan.newBuilder().setCommand(command).build() -client.execute(plan) +client.execute(plan).asScala.foreach(_ => ()) } @DeveloperApi diff --git a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala index 18aef8a2e4c..e5c89d90c19 100644 --- a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala +++ b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala @@ -92,7 +92,7 @@ sealed abstract class UserDefinedFunction {
[spark] branch master updated: [SPARK-43744][CONNECT] Fix class loading problem caused by stub user classes not found on the server classpath
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 6d0fed9a18f [SPARK-43744][CONNECT] Fix class loading problem caused by stub user classes not found on the server classpath 6d0fed9a18f is described below commit 6d0fed9a18ff87e73fdf1ee46b6b0d2df8dd5a1b Author: Zhen Li AuthorDate: Fri Jul 28 22:59:07 2023 -0400 [SPARK-43744][CONNECT] Fix class loading problem caused by stub user classes not found on the server classpath ### What changes were proposed in this pull request? This PR introduces a stub class loader for unpacking Scala UDFs in the driver and the executor. When encountering user classes that are not found on the server session classpath, the stub class loader would try to stub the class. This solves the problem that when serializing UDFs, Java serializer might include unnecessary user code e.g. User classes used in the lambda definition signatures in the same class where the UDF is defined. If the user code is actually needed to execute the UDF, we will return an error message to suggest the user to add the missing classes using the `addArtifact` method. ### Why are the changes needed? To enhance the user experience of UDF. This PR should be merged to master and 3.5. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added test both for Scala 2.12 & 2.13 4 tests in SparkSessionE2ESuite still fail to run with maven after the fix because the client test jar is installed on the system classpath (added using --jar at server start), the stub classloader can only stub classes missing from the session classpath (added using `session.addArtifact`). Moving the test jar to the session classpath causes failures in tests for `flatMapGroupsWithState` (SPARK-44576). Finish moving the test jar to session classpath once `flatMapGroupsWithState` test failures are fixed. Closes #42069 from zhenlineo/ref-spark-result. Authored-by: Zhen Li Signed-off-by: Herman van Hovell --- .../scala/org/apache/spark/sql/SparkSession.scala | 2 +- .../sql/expressions/UserDefinedFunction.scala | 2 +- .../jvm/src/test/resources/StubClassDummyUdf.scala | 56 + .../connect/client/jvm/src/test/resources/udf2.12 | Bin 0 -> 1520 bytes .../client/jvm/src/test/resources/udf2.12.jar | Bin 0 -> 5332 bytes .../connect/client/jvm/src/test/resources/udf2.13 | Bin 0 -> 1630 bytes .../client/jvm/src/test/resources/udf2.13.jar | Bin 0 -> 5674 bytes .../connect/client/UDFClassLoadingE2ESuite.scala | 83 + .../connect/client/util/IntegrationTestUtils.scala | 2 +- .../connect/client/util/RemoteSparkSession.scala | 2 +- .../artifact/SparkConnectArtifactManager.scala | 17 ++- .../sql/connect/planner/SparkConnectPlanner.scala | 23 +++- connector/connect/server/src/test/resources/udf| Bin 0 -> 973 bytes .../connect/server/src/test/resources/udf_noA.jar | Bin 0 -> 5545 bytes .../connect/artifact/StubClassLoaderSuite.scala| 132 + .../spark/util/ChildFirstURLClassLoader.java | 9 ++ .../scala/org/apache/spark/executor/Executor.scala | 86 +++--- .../org/apache/spark/internal/config/package.scala | 14 +++ .../org/apache/spark/util/StubClassLoader.scala| 79 19 files changed, 480 insertions(+), 27 deletions(-) diff --git a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala index d1832e65f3e..4b3de91b56f 100644 --- a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala +++ b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala @@ -554,7 +554,7 @@ class SparkSession private[sql] ( val command = proto.Command.newBuilder().setRegisterFunction(udf).build() val plan = proto.Plan.newBuilder().setCommand(command).build() -client.execute(plan) +client.execute(plan).asScala.foreach(_ => ()) } @DeveloperApi diff --git a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala index 18aef8a2e4c..e5c89d90c19 100644 --- a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala +++ b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala @@ -92,7 +92,7 @@ sealed abstract class UserDefinedFunction { /** * Holder class for a scalar user-defined function and it's input/output encoder(s). */ -case class
[spark] branch branch-3.5 updated: [SPARK-44574][SQL][CONNECT] Errors that moved into sq/api should also use AnalysisException
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new d01821e0a99 [SPARK-44574][SQL][CONNECT] Errors that moved into sq/api should also use AnalysisException d01821e0a99 is described below commit d01821e0a99f527b1c151c19b0f44c26121ad57d Author: Rui Wang AuthorDate: Fri Jul 28 22:47:00 2023 -0400 [SPARK-44574][SQL][CONNECT] Errors that moved into sq/api should also use AnalysisException ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/42130 moved AnalysisException to sql/api. So we can use AnalysisException for those errors that were moved into sql/api but changed to SparkException. ### Why are the changes needed? Restore to previous behavior. ### Does this PR introduce _any_ user-facing change? No. This PR recovers the behaviors to the past which users should see AnalysisException upon many cases. ### How was this patch tested? Existing UT. Closes #42190 from amaliujia/convert_back_errors. Authored-by: Rui Wang Signed-off-by: Herman van Hovell (cherry picked from commit 3761b7d65bd1e21b5a7c5966c2d03a0fe4a0b982) Signed-off-by: Herman van Hovell --- .../apache/spark/sql/errors/DataTypeErrors.scala | 53 ++ .../spark/sql/errors/QueryCompilationErrors.scala | 7 +-- .../apache/spark/sql/types/StructTypeSuite.scala | 26 +-- .../apache/spark/sql/CharVarcharTestSuite.scala| 28 +++- .../org/apache/spark/sql/DataFrameSuite.scala | 2 +- .../org/apache/spark/sql/JsonFunctionsSuite.scala | 6 +-- .../spark/sql/connector/AlterTableTests.scala | 2 +- .../sql/errors/QueryCompilationErrorsSuite.scala | 4 +- .../org/apache/spark/sql/jdbc/JDBCSuite.scala | 4 +- .../sql/test/DataFrameReaderWriterSuite.scala | 12 ++--- 10 files changed, 68 insertions(+), 76 deletions(-) diff --git a/sql/api/src/main/scala/org/apache/spark/sql/errors/DataTypeErrors.scala b/sql/api/src/main/scala/org/apache/spark/sql/errors/DataTypeErrors.scala index fcc3086b573..7a34a386cd8 100644 --- a/sql/api/src/main/scala/org/apache/spark/sql/errors/DataTypeErrors.scala +++ b/sql/api/src/main/scala/org/apache/spark/sql/errors/DataTypeErrors.scala @@ -17,8 +17,10 @@ package org.apache.spark.sql.errors import org.apache.spark.{SparkArithmeticException, SparkException, SparkIllegalArgumentException, SparkNumberFormatException, SparkRuntimeException, SparkUnsupportedOperationException} +import org.apache.spark.sql.AnalysisException import org.apache.spark.sql.catalyst.trees.{Origin, SQLQueryContext} import org.apache.spark.sql.catalyst.util.QuotingUtils +import org.apache.spark.sql.catalyst.util.QuotingUtils.toSQLSchema import org.apache.spark.sql.types.{DataType, Decimal, StringType} import org.apache.spark.unsafe.types.UTF8String @@ -97,49 +99,45 @@ private[sql] object DataTypeErrors extends DataTypeErrorsBase { } def schemaFailToParseError(schema: String, e: Throwable): Throwable = { -new SparkException( +new AnalysisException( errorClass = "INVALID_SCHEMA.PARSE_ERROR", messageParameters = Map( -"inputSchema" -> QuotingUtils.toSQLSchema(schema), +"inputSchema" -> toSQLSchema(schema), "reason" -> e.getMessage ), - cause = e) + cause = Some(e)) } def invalidDayTimeIntervalType(startFieldName: String, endFieldName: String): Throwable = { -new SparkException( +new AnalysisException( errorClass = "_LEGACY_ERROR_TEMP_1224", messageParameters = Map( "startFieldName" -> startFieldName, -"endFieldName" -> endFieldName), - cause = null) +"endFieldName" -> endFieldName)) } def invalidDayTimeField(field: Byte, supportedIds: Seq[String]): Throwable = { -new SparkException( +new AnalysisException( errorClass = "_LEGACY_ERROR_TEMP_1223", messageParameters = Map( "field" -> field.toString, -"supportedIds" -> supportedIds.mkString(", ")), - cause = null) +"supportedIds" -> supportedIds.mkString(", "))) } def invalidYearMonthField(field: Byte, supportedIds: Seq[String]): Throwable = { -new SparkException( +new AnalysisException( errorClass = "_LEGACY_ERROR_TEMP_1225", messageParameters = Map( "field" -> field.toString, -"supportedIds" -> supportedIds.mkString(", ")), - cause = null) +"supportedIds" -> supportedIds.mkString(", "))) } def decimalCannotGreaterThanPrecisionError(scale: Int, precision: Int): Throwable = { -new SparkException( +new AnalysisException( errorClass = "_LEGACY_ERROR_TEMP_1228", messageParameters = Map(
[spark] branch master updated (85a4d1e56e8 -> 3761b7d65bd)
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 85a4d1e56e8 [SPARK-41400][CONNECT] Remove Connect Client Catalyst Dependency add 3761b7d65bd [SPARK-44574][SQL][CONNECT] Errors that moved into sq/api should also use AnalysisException No new revisions were added by this update. Summary of changes: .../apache/spark/sql/errors/DataTypeErrors.scala | 53 ++ .../spark/sql/errors/QueryCompilationErrors.scala | 7 +-- .../apache/spark/sql/types/StructTypeSuite.scala | 26 +-- .../apache/spark/sql/CharVarcharTestSuite.scala| 28 +++- .../org/apache/spark/sql/DataFrameSuite.scala | 2 +- .../org/apache/spark/sql/JsonFunctionsSuite.scala | 6 +-- .../spark/sql/connector/AlterTableTests.scala | 2 +- .../sql/errors/QueryCompilationErrorsSuite.scala | 4 +- .../org/apache/spark/sql/jdbc/JDBCSuite.scala | 4 +- .../sql/test/DataFrameReaderWriterSuite.scala | 12 ++--- 10 files changed, 68 insertions(+), 76 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (72af2c0fbc6 -> 85a4d1e56e8)
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 72af2c0fbc6 [SPARK-44585][MLLIB] Fix warning condition in MLLib RankingMetrics ndcgAk add 85a4d1e56e8 [SPARK-41400][CONNECT] Remove Connect Client Catalyst Dependency No new revisions were added by this update. Summary of changes: common/network-common/pom.xml | 3 +- .../apache/spark/network/sasl/SparkSaslSuite.java | 5 +- .../org/apache/spark/network/util/ByteUnit.java| 0 .../org/apache/spark/network/util/JavaUtils.java | 66 ++--- .../org/apache/spark/util/SparkErrorUtils.scala| 53 +++- .../org/apache/spark/util/SparkFileUtils.scala | 80 +- .../org/apache/spark/util/SparkSerDeUtils.scala| 9 +- connector/connect/client/jvm/pom.xml | 14 +- .../scala/org/apache/spark/sql/SparkSession.scala | 6 +- .../connect/client/arrow/ArrowDeserializer.scala | 6 +- .../sql/connect/client/arrow/ArrowSerializer.scala | 19 +- .../connect/client/arrow/ArrowVectorReader.scala | 15 +- .../org/apache/spark/sql/protobuf/functions.scala | 15 +- .../org/apache/spark/sql/ClientE2ETestSuite.scala | 7 +- .../spark/sql/DataFrameNaFunctionSuite.scala | 6 +- .../apache/spark/sql/PlanGenerationTestSuite.scala | 6 +- .../scala/org/apache/spark/sql/SQLHelper.scala | 11 +- .../apache/spark/sql/SQLImplicitsTestSuite.scala | 31 ++- .../apache/spark/sql/SparkSessionE2ESuite.scala| 16 +- .../sql/UserDefinedFunctionE2ETestSuite.scala | 15 +- .../spark/sql/UserDefinedFunctionSuite.scala | 4 +- .../sql/connect/client/ClassFinderSuite.scala | 6 +- .../connect/client/arrow/ArrowEncoderSuite.scala | 6 +- .../connect/client/util/IntegrationTestUtils.scala | 11 +- .../spark/sql/connect/client/util/QueryTest.scala | 4 +- .../spark/sql/streaming/StreamingQuerySuite.scala | 6 +- connector/connect/common/pom.xml | 8 +- .../common/LiteralValueProtoConverter.scala| 14 +- .../main/scala/org/apache/spark/util/Utils.scala | 132 +- .../catalyst/plans/logical/LogicalGroupState.scala | 7 +- .../sql/catalyst/util/SparkIntervalUtils.scala | 287 - .../spark/sql/catalyst/util/StringUtils.scala | 20 ++ .../spark/sql/errors/CompilationErrors.scala | 54 .../org/apache/spark/sql/internal/SqlApiConf.scala | 3 + .../org/apache/spark/sql/types/UpCastRule.scala| 86 ++ .../sql/catalyst/analysis/AnsiTypeCoercion.scala | 2 +- .../spark/sql/catalyst/analysis/TypeCoercion.scala | 12 +- .../spark/sql/catalyst/expressions/Cast.scala | 47 +--- .../sql/catalyst/expressions/ToStringBase.scala| 6 +- .../spark/sql/catalyst/plans/logical/object.scala | 3 - .../spark/sql/catalyst/util/IntervalUtils.scala| 260 --- .../apache/spark/sql/catalyst/util/package.scala | 21 +- .../spark/sql/errors/QueryCompilationErrors.scala | 33 +-- .../org/apache/spark/sql/internal/SQLConf.scala| 6 +- .../sql/catalyst/expressions/CastSuiteBase.scala | 2 +- 45 files changed, 751 insertions(+), 672 deletions(-) rename common/{network-common => utils}/src/main/java/org/apache/spark/network/util/ByteUnit.java (100%) rename common/{network-common => utils}/src/main/java/org/apache/spark/network/util/JavaUtils.java (89%) copy connector/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/package-info.java => sql/api/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalGroupState.scala (86%) create mode 100644 sql/api/src/main/scala/org/apache/spark/sql/errors/CompilationErrors.scala create mode 100644 sql/api/src/main/scala/org/apache/spark/sql/types/UpCastRule.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.5 updated: [SPARK-41400][CONNECT] Remove Connect Client Catalyst Dependency
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 5412fb0590e [SPARK-41400][CONNECT] Remove Connect Client Catalyst Dependency 5412fb0590e is described below commit 5412fb0590e55d635e9e31887ec5c72d10011899 Author: Herman van Hovell AuthorDate: Fri Jul 28 21:30:51 2023 -0400 [SPARK-41400][CONNECT] Remove Connect Client Catalyst Dependency ### What changes were proposed in this pull request? This PR decouples the Spark Connect Scala Client from Catalyst, it now used SQL API module instead. There were quite a few changes we still needed to make: - For testing we needed a bunch of utilities. I have moved these to common-utils. - I have moved bits and pieces of IntervalUtils to SparkIntervalUtils. - A lot of small fixes. ### Why are the changes needed? This reduces the client's dependency tree from ~300 MB of deps to ~30MB. This makes it easier to use the client when you are developing connect applications. On top of this the reduced dependency graph also means folks will be less affected by the clients' classpath. ### Does this PR introduce _any_ user-facing change? Yes. It changes the classpath exposed by the Spark Connect Scala Client. ### How was this patch tested? Existing tests. Closes #42184 from hvanhovell/SPARK-41400-v1. Authored-by: Herman van Hovell Signed-off-by: Herman van Hovell (cherry picked from commit 85a4d1e56e80d85dad9b8945c67287927eb379f6) Signed-off-by: Herman van Hovell --- common/network-common/pom.xml | 3 +- .../apache/spark/network/sasl/SparkSaslSuite.java | 5 +- .../org/apache/spark/network/util/ByteUnit.java| 0 .../org/apache/spark/network/util/JavaUtils.java | 66 ++--- .../org/apache/spark/util/SparkErrorUtils.scala| 53 +++- .../org/apache/spark/util/SparkFileUtils.scala | 80 +- .../org/apache/spark/util/SparkSerDeUtils.scala| 9 +- connector/connect/client/jvm/pom.xml | 14 +- .../scala/org/apache/spark/sql/SparkSession.scala | 6 +- .../connect/client/arrow/ArrowDeserializer.scala | 6 +- .../sql/connect/client/arrow/ArrowSerializer.scala | 19 +- .../connect/client/arrow/ArrowVectorReader.scala | 15 +- .../org/apache/spark/sql/protobuf/functions.scala | 15 +- .../org/apache/spark/sql/ClientE2ETestSuite.scala | 7 +- .../spark/sql/DataFrameNaFunctionSuite.scala | 6 +- .../apache/spark/sql/PlanGenerationTestSuite.scala | 6 +- .../scala/org/apache/spark/sql/SQLHelper.scala | 11 +- .../apache/spark/sql/SQLImplicitsTestSuite.scala | 31 ++- .../apache/spark/sql/SparkSessionE2ESuite.scala| 16 +- .../sql/UserDefinedFunctionE2ETestSuite.scala | 15 +- .../spark/sql/UserDefinedFunctionSuite.scala | 4 +- .../sql/connect/client/ClassFinderSuite.scala | 6 +- .../connect/client/arrow/ArrowEncoderSuite.scala | 6 +- .../connect/client/util/IntegrationTestUtils.scala | 11 +- .../spark/sql/connect/client/util/QueryTest.scala | 4 +- .../spark/sql/streaming/StreamingQuerySuite.scala | 6 +- connector/connect/common/pom.xml | 8 +- .../common/LiteralValueProtoConverter.scala| 14 +- .../main/scala/org/apache/spark/util/Utils.scala | 132 +- .../catalyst/plans/logical/LogicalGroupState.scala | 16 +- .../sql/catalyst/util/SparkIntervalUtils.scala | 287 - .../spark/sql/catalyst/util/StringUtils.scala | 20 ++ .../spark/sql/errors/CompilationErrors.scala | 54 .../org/apache/spark/sql/internal/SqlApiConf.scala | 3 + .../org/apache/spark/sql/types/UpCastRule.scala| 86 ++ .../sql/catalyst/analysis/AnsiTypeCoercion.scala | 2 +- .../spark/sql/catalyst/analysis/TypeCoercion.scala | 12 +- .../spark/sql/catalyst/expressions/Cast.scala | 47 +--- .../sql/catalyst/expressions/ToStringBase.scala| 6 +- .../spark/sql/catalyst/plans/logical/object.scala | 3 - .../spark/sql/catalyst/util/IntervalUtils.scala| 260 --- .../apache/spark/sql/catalyst/util/package.scala | 21 +- .../spark/sql/errors/QueryCompilationErrors.scala | 33 +-- .../org/apache/spark/sql/internal/SQLConf.scala| 6 +- .../sql/catalyst/expressions/CastSuiteBase.scala | 2 +- 45 files changed, 751 insertions(+), 681 deletions(-) diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml index 8a63e999c53..2b43f9ce98a 100644 --- a/common/network-common/pom.xml +++ b/common/network-common/pom.xml @@ -150,7 +150,8 @@ org.apache.spark - spark-tags_${scala.binary.version} + spark-common-utils_${scala.binary.version} + ${project.version}
[spark] branch branch-3.4 updated: [SPARK-44585][MLLIB] Fix warning condition in MLLib RankingMetrics ndcgAk
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new f19a953b647 [SPARK-44585][MLLIB] Fix warning condition in MLLib RankingMetrics ndcgAk f19a953b647 is described below commit f19a953b6471673f89d689bea20e0d53026f7b5b Author: Guilhem Vuillier <101632595+guilhem-de...@users.noreply.github.com> AuthorDate: Fri Jul 28 17:29:47 2023 -0500 [SPARK-44585][MLLIB] Fix warning condition in MLLib RankingMetrics ndcgAk ### What changes were proposed in this pull request? This PR fixes the condition to raise the following warning in MLLib's RankingMetrics ndcgAk function: "# of ground truth set and # of relevance value set should be equal, check input data" The logic for raising warnings is faulty at the moment: it raises a warning if the `rel` input is empty and `lab.size` and `rel.size` are not equal. The logic should be to raise a warning if `rel` input is **not empty** and `lab.size` and `rel.size` are not equal. This warning was added in the following PR: https://github.com/apache/spark/pull/36843 ### Why are the changes needed? With the current logic, RankingMetrics will: - raise incorrect warning when a user is using it in the "binary" mode (i.e. no relevance values in the input) - not raise warning (that could be necessary) when the user is using it in the "non-binary" model (i.e. with relevance values in the input) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No change made to the test suite for RankingMetrics: https://github.com/uch/spark/blob/a172172329cc78b50f716924f2a344517deb71fc/mllib/src/test/scala/org/apache/spark/mllib/evaluation/RankingMetricsSuite.scala Closes #42207 from guilhem-depop/patch-1. Authored-by: Guilhem Vuillier <101632595+guilhem-de...@users.noreply.github.com> Signed-off-by: Sean Owen (cherry picked from commit 72af2c0fbc6673a5e49f1fd6693fe2c90141a84f) Signed-off-by: Sean Owen --- .../scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala b/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala index 37e57736574..a3316d8a8fa 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala @@ -140,6 +140,9 @@ class RankingMetrics[T: ClassTag] @Since("1.2.0") (predictionAndLabels: RDD[_ <: * and the NDCG is obtained by dividing the DCG value on the ground truth set. In the current * implementation, the relevance value is binary if the relevance value is empty. + * If the relevance value is not empty but its size doesn't match the ground truth set size, + * a log warning is generated. + * * If a query has an empty ground truth set, zero will be used as ndcg together with * a log warning. * @@ -157,7 +160,7 @@ class RankingMetrics[T: ClassTag] @Since("1.2.0") (predictionAndLabels: RDD[_ <: val useBinary = rel.isEmpty val labSet = lab.toSet val relMap = Utils.toMap(lab, rel) - if (useBinary && lab.size != rel.size) { + if (!useBinary && lab.size != rel.size) { logWarning( "# of ground truth set and # of relevance value set should be equal, " + "check input data") - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.5 updated: [SPARK-44585][MLLIB] Fix warning condition in MLLib RankingMetrics ndcgAk
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new d0fa5a75d17 [SPARK-44585][MLLIB] Fix warning condition in MLLib RankingMetrics ndcgAk d0fa5a75d17 is described below commit d0fa5a75d17335e60aefbb554adb9b3fce1f97ff Author: Guilhem Vuillier <101632595+guilhem-de...@users.noreply.github.com> AuthorDate: Fri Jul 28 17:29:47 2023 -0500 [SPARK-44585][MLLIB] Fix warning condition in MLLib RankingMetrics ndcgAk ### What changes were proposed in this pull request? This PR fixes the condition to raise the following warning in MLLib's RankingMetrics ndcgAk function: "# of ground truth set and # of relevance value set should be equal, check input data" The logic for raising warnings is faulty at the moment: it raises a warning if the `rel` input is empty and `lab.size` and `rel.size` are not equal. The logic should be to raise a warning if `rel` input is **not empty** and `lab.size` and `rel.size` are not equal. This warning was added in the following PR: https://github.com/apache/spark/pull/36843 ### Why are the changes needed? With the current logic, RankingMetrics will: - raise incorrect warning when a user is using it in the "binary" mode (i.e. no relevance values in the input) - not raise warning (that could be necessary) when the user is using it in the "non-binary" model (i.e. with relevance values in the input) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No change made to the test suite for RankingMetrics: https://github.com/uch/spark/blob/a172172329cc78b50f716924f2a344517deb71fc/mllib/src/test/scala/org/apache/spark/mllib/evaluation/RankingMetricsSuite.scala Closes #42207 from guilhem-depop/patch-1. Authored-by: Guilhem Vuillier <101632595+guilhem-de...@users.noreply.github.com> Signed-off-by: Sean Owen (cherry picked from commit 72af2c0fbc6673a5e49f1fd6693fe2c90141a84f) Signed-off-by: Sean Owen --- .../scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala b/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala index 37e57736574..a3316d8a8fa 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala @@ -140,6 +140,9 @@ class RankingMetrics[T: ClassTag] @Since("1.2.0") (predictionAndLabels: RDD[_ <: * and the NDCG is obtained by dividing the DCG value on the ground truth set. In the current * implementation, the relevance value is binary if the relevance value is empty. + * If the relevance value is not empty but its size doesn't match the ground truth set size, + * a log warning is generated. + * * If a query has an empty ground truth set, zero will be used as ndcg together with * a log warning. * @@ -157,7 +160,7 @@ class RankingMetrics[T: ClassTag] @Since("1.2.0") (predictionAndLabels: RDD[_ <: val useBinary = rel.isEmpty val labSet = lab.toSet val relMap = Utils.toMap(lab, rel) - if (useBinary && lab.size != rel.size) { + if (!useBinary && lab.size != rel.size) { logWarning( "# of ground truth set and # of relevance value set should be equal, " + "check input data") - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-44585][MLLIB] Fix warning condition in MLLib RankingMetrics ndcgAk
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 72af2c0fbc6 [SPARK-44585][MLLIB] Fix warning condition in MLLib RankingMetrics ndcgAk 72af2c0fbc6 is described below commit 72af2c0fbc6673a5e49f1fd6693fe2c90141a84f Author: Guilhem Vuillier <101632595+guilhem-de...@users.noreply.github.com> AuthorDate: Fri Jul 28 17:29:47 2023 -0500 [SPARK-44585][MLLIB] Fix warning condition in MLLib RankingMetrics ndcgAk ### What changes were proposed in this pull request? This PR fixes the condition to raise the following warning in MLLib's RankingMetrics ndcgAk function: "# of ground truth set and # of relevance value set should be equal, check input data" The logic for raising warnings is faulty at the moment: it raises a warning if the `rel` input is empty and `lab.size` and `rel.size` are not equal. The logic should be to raise a warning if `rel` input is **not empty** and `lab.size` and `rel.size` are not equal. This warning was added in the following PR: https://github.com/apache/spark/pull/36843 ### Why are the changes needed? With the current logic, RankingMetrics will: - raise incorrect warning when a user is using it in the "binary" mode (i.e. no relevance values in the input) - not raise warning (that could be necessary) when the user is using it in the "non-binary" model (i.e. with relevance values in the input) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No change made to the test suite for RankingMetrics: https://github.com/uch/spark/blob/a172172329cc78b50f716924f2a344517deb71fc/mllib/src/test/scala/org/apache/spark/mllib/evaluation/RankingMetricsSuite.scala Closes #42207 from guilhem-depop/patch-1. Authored-by: Guilhem Vuillier <101632595+guilhem-de...@users.noreply.github.com> Signed-off-by: Sean Owen --- .../scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala b/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala index 37e57736574..a3316d8a8fa 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala @@ -140,6 +140,9 @@ class RankingMetrics[T: ClassTag] @Since("1.2.0") (predictionAndLabels: RDD[_ <: * and the NDCG is obtained by dividing the DCG value on the ground truth set. In the current * implementation, the relevance value is binary if the relevance value is empty. + * If the relevance value is not empty but its size doesn't match the ground truth set size, + * a log warning is generated. + * * If a query has an empty ground truth set, zero will be used as ndcg together with * a log warning. * @@ -157,7 +160,7 @@ class RankingMetrics[T: ClassTag] @Since("1.2.0") (predictionAndLabels: RDD[_ <: val useBinary = rel.isEmpty val labSet = lab.toSet val relMap = Utils.toMap(lab, rel) - if (useBinary && lab.size != rel.size) { + if (!useBinary && lab.size != rel.size) { logWarning( "# of ground truth set and # of relevance value set should be equal, " + "check input data") - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.5 updated: [SPARK-37377][SQL][FOLLOWUP] Fix the partitioned join of one side test case not match
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 6879519cafa [SPARK-37377][SQL][FOLLOWUP] Fix the partitioned join of one side test case not match 6879519cafa is described below commit 6879519cafa913f61a5f6029125e41cf79e80168 Author: Jia Fan AuthorDate: Fri Jul 28 19:14:56 2023 +0800 [SPARK-37377][SQL][FOLLOWUP] Fix the partitioned join of one side test case not match ### What changes were proposed in this pull request? The partitioned join of one side test case not match with description. Current, always two side reports partitioning not one. ### Why are the changes needed? Fix test case ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? add new test. Closes #42144 from Hisoka-X/SPARK-37377_one_side_test_case. Authored-by: Jia Fan Signed-off-by: Wenchen Fan (cherry picked from commit 20bb6c0c5a84345bb09ac3cab6267a5747b6be05) Signed-off-by: Wenchen Fan --- .../org/apache/spark/sql/connector/KeyGroupedPartitioningSuite.scala | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/connector/KeyGroupedPartitioningSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/connector/KeyGroupedPartitioningSuite.scala index 8be3c6d9e13..880c30ba9f9 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/connector/KeyGroupedPartitioningSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/connector/KeyGroupedPartitioningSuite.scala @@ -313,9 +313,8 @@ class KeyGroupedPartitioningSuite extends DistributionAndOrderingSuiteBase { test("partitioned join: only one side reports partitioning") { val customers_partitions = Array(bucket(4, "customer_id")) -val orders_partitions = Array(bucket(2, "customer_id")) -testWithCustomersAndOrders(customers_partitions, orders_partitions, 2) +testWithCustomersAndOrders(customers_partitions, Array.empty, 2) } private val items: String = "items" - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-37377][SQL][FOLLOWUP] Fix the partitioned join of one side test case not match
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 20bb6c0c5a8 [SPARK-37377][SQL][FOLLOWUP] Fix the partitioned join of one side test case not match 20bb6c0c5a8 is described below commit 20bb6c0c5a84345bb09ac3cab6267a5747b6be05 Author: Jia Fan AuthorDate: Fri Jul 28 19:14:56 2023 +0800 [SPARK-37377][SQL][FOLLOWUP] Fix the partitioned join of one side test case not match ### What changes were proposed in this pull request? The partitioned join of one side test case not match with description. Current, always two side reports partitioning not one. ### Why are the changes needed? Fix test case ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? add new test. Closes #42144 from Hisoka-X/SPARK-37377_one_side_test_case. Authored-by: Jia Fan Signed-off-by: Wenchen Fan --- .../org/apache/spark/sql/connector/KeyGroupedPartitioningSuite.scala | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/connector/KeyGroupedPartitioningSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/connector/KeyGroupedPartitioningSuite.scala index 8be3c6d9e13..880c30ba9f9 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/connector/KeyGroupedPartitioningSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/connector/KeyGroupedPartitioningSuite.scala @@ -313,9 +313,8 @@ class KeyGroupedPartitioningSuite extends DistributionAndOrderingSuiteBase { test("partitioned join: only one side reports partitioning") { val customers_partitions = Array(bucket(4, "customer_id")) -val orders_partitions = Array(bucket(2, "customer_id")) -testWithCustomersAndOrders(customers_partitions, orders_partitions, 2) +testWithCustomersAndOrders(customers_partitions, Array.empty, 2) } private val items: String = "items" - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch 3.5 created (now 3dcee0ccf7e)
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a change to branch 3.5 in repository https://gitbox.apache.org/repos/asf/spark.git at 3dcee0ccf7e [SPARK-44538][CONNECT][SQL] Reinstate Row.jsonValue and friends No new revisions were added by this update. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.5 updated: [SPARK-44552][SQL] Remove `private object ParseState` definition from `IntervalUtils`
This is an automated email from the ASF dual-hosted git repository. yangjie01 pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 81a939d375b [SPARK-44552][SQL] Remove `private object ParseState` definition from `IntervalUtils` 81a939d375b is described below commit 81a939d375bad482333a8e87f00d21dd64048b0e Author: yangjie01 AuthorDate: Fri Jul 28 18:09:48 2023 +0800 [SPARK-44552][SQL] Remove `private object ParseState` definition from `IntervalUtils` ### What changes were proposed in this pull request? SPARK-44326(https://github.com/apache/spark/pull/41885) moved the relevant code from `IntervalUtils` to `SparkIntervalUtils` (including `private object ParseState` definition), https://github.com/apache/spark/blob/6ca45c52b7416e7b3520dc902cb24f060c7c72dd/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkIntervalUtils.scala#L247-L260 but did not delete the definition of `private object ParseState` in `IntervalUtils`. So this pr cleans up it. ### Why are the changes needed? Code cleanup. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass Github Actions Closes #42165 from LuciferYang/SPARK-44552. Lead-authored-by: yangjie01 Co-authored-by: YangJie Signed-off-by: yangjie01 (cherry picked from commit 7ee9e43ec6891f26b36a090a83536d3cae682861) Signed-off-by: yangjie01 --- .../apache/spark/sql/catalyst/util/IntervalUtils.scala| 15 --- 1 file changed, 15 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala index 6ba59b4e730..24620e692a0 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala @@ -739,21 +739,6 @@ object IntervalUtils extends SparkIntervalUtils { fromDoubles(interval.months / num, interval.days / num, interval.microseconds / num) } - private object ParseState extends Enumeration { -type ParseState = Value - -val PREFIX, -TRIM_BEFORE_SIGN, -SIGN, -TRIM_BEFORE_VALUE, -VALUE, -VALUE_FRACTIONAL_PART, -TRIM_BEFORE_UNIT, -UNIT_BEGIN, -UNIT_SUFFIX, -UNIT_END = Value - } - /** * A safe version of `stringToInterval`. It returns null for invalid input string. */ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-44552][SQL] Remove `private object ParseState` definition from `IntervalUtils`
This is an automated email from the ASF dual-hosted git repository. yangjie01 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 7ee9e43ec68 [SPARK-44552][SQL] Remove `private object ParseState` definition from `IntervalUtils` 7ee9e43ec68 is described below commit 7ee9e43ec6891f26b36a090a83536d3cae682861 Author: yangjie01 AuthorDate: Fri Jul 28 18:09:48 2023 +0800 [SPARK-44552][SQL] Remove `private object ParseState` definition from `IntervalUtils` ### What changes were proposed in this pull request? SPARK-44326(https://github.com/apache/spark/pull/41885) moved the relevant code from `IntervalUtils` to `SparkIntervalUtils` (including `private object ParseState` definition), https://github.com/apache/spark/blob/6ca45c52b7416e7b3520dc902cb24f060c7c72dd/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkIntervalUtils.scala#L247-L260 but did not delete the definition of `private object ParseState` in `IntervalUtils`. So this pr cleans up it. ### Why are the changes needed? Code cleanup. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass Github Actions Closes #42165 from LuciferYang/SPARK-44552. Lead-authored-by: yangjie01 Co-authored-by: YangJie Signed-off-by: yangjie01 --- .../apache/spark/sql/catalyst/util/IntervalUtils.scala| 15 --- 1 file changed, 15 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala index 6ba59b4e730..24620e692a0 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala @@ -739,21 +739,6 @@ object IntervalUtils extends SparkIntervalUtils { fromDoubles(interval.months / num, interval.days / num, interval.microseconds / num) } - private object ParseState extends Enumeration { -type ParseState = Value - -val PREFIX, -TRIM_BEFORE_SIGN, -SIGN, -TRIM_BEFORE_VALUE, -VALUE, -VALUE_FRACTIONAL_PART, -TRIM_BEFORE_UNIT, -UNIT_BEGIN, -UNIT_SUFFIX, -UNIT_END = Value - } - /** * A safe version of `stringToInterval`. It returns null for invalid input string. */ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.5 updated: [SPARK-43968][PYTHON][3.5] Improve error messages for Python UDTFs with wrong number of outputs
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 7059b69e67d [SPARK-43968][PYTHON][3.5] Improve error messages for Python UDTFs with wrong number of outputs 7059b69e67d is described below commit 7059b69e67db8126dafc3d4b1f3b39e947c4c3ca Author: allisonwang-db AuthorDate: Fri Jul 28 15:30:17 2023 +0900 [SPARK-43968][PYTHON][3.5] Improve error messages for Python UDTFs with wrong number of outputs ### What changes were proposed in this pull request? This PR cherry-picks 7194ce9263fe1683c039a1aaf9462657b1672a99. It improves the error messages for Python UDTFs when the number of outputs mismatches the number of outputs specified in the return type of the UDTFs. ### Why are the changes needed? To make Python UDTFs more user-friendly. ### Does this PR introduce _any_ user-facing change? Yes. This PR improves the error messages. Before this change, the error thrown by Spark will be a java IllegalStateException: ``` java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema ``` After this PR, it will throw a clearer error message with an error class: ``` [UDTF_RETURN_SCHEMA_MISMATCH] The number of columns in the result does not match the specified schema ``` ### How was this patch tested? Existing tests and new unit tests. Closes #42192 from allisonwang-db/spark-43968-3.5. Authored-by: allisonwang-db Signed-off-by: Hyukjin Kwon --- python/pyspark/errors/error_classes.py | 5 + python/pyspark/sql/connect/udtf.py | 4 +- .../pyspark/sql/tests/connect/test_parity_udtf.py | 50 python/pyspark/sql/tests/test_udtf.py | 133 +++-- python/pyspark/sql/udtf.py | 9 +- python/pyspark/worker.py | 22 +++- 6 files changed, 99 insertions(+), 124 deletions(-) diff --git a/python/pyspark/errors/error_classes.py b/python/pyspark/errors/error_classes.py index b1bf6b47af9..f6411fac1da 100644 --- a/python/pyspark/errors/error_classes.py +++ b/python/pyspark/errors/error_classes.py @@ -320,6 +320,11 @@ ERROR_CLASSES_JSON = """ "The eval type for the UDTF '' is invalid. It must be one of ." ] }, + "INVALID_UDTF_HANDLER_TYPE" : { +"message" : [ + "The UDTF is invalid. The function handler must be a class, but got ''. Please provide a class as the function handler." +] + }, "INVALID_UDTF_NO_EVAL" : { "message" : [ "The UDTF '' is invalid. It does not implement the required 'eval' method. Please implement the 'eval' method in '' and try again." diff --git a/python/pyspark/sql/connect/udtf.py b/python/pyspark/sql/connect/udtf.py index 1fe8e1024ee..3747e37459e 100644 --- a/python/pyspark/sql/connect/udtf.py +++ b/python/pyspark/sql/connect/udtf.py @@ -124,6 +124,8 @@ class UserDefinedTableFunction: evalType: int = PythonEvalType.SQL_TABLE_UDF, deterministic: bool = True, ) -> None: +_validate_udtf_handler(func) + self.func = func self.returnType: DataType = ( UnparsedDataType(returnType) if isinstance(returnType, str) else returnType @@ -132,8 +134,6 @@ class UserDefinedTableFunction: self.evalType = evalType self.deterministic = deterministic -_validate_udtf_handler(func) - def _build_common_inline_user_defined_table_function( self, *cols: "ColumnOrName" ) -> CommonInlineUserDefinedTableFunction: diff --git a/python/pyspark/sql/tests/connect/test_parity_udtf.py b/python/pyspark/sql/tests/connect/test_parity_udtf.py index e18e116e003..355f5288d2c 100644 --- a/python/pyspark/sql/tests/connect/test_parity_udtf.py +++ b/python/pyspark/sql/tests/connect/test_parity_udtf.py @@ -54,56 +54,6 @@ class UDTFParityTests(BaseUDTFTestsMixin, ReusedConnectTestCase): ): TestUDTF(lit(1)).collect() -def test_udtf_with_wrong_num_output(self): -err_msg = ( -"java.lang.IllegalStateException: Input row doesn't have expected number of " -+ "values required by the schema." -) - -@udtf(returnType="a: int, b: int") -class TestUDTF: -def eval(self, a: int): -yield a, - -with self.assertRaisesRegex(SparkConnectGrpcException, err_msg): -TestUDTF(lit(1)).collect() - -@udtf(returnType="a: int") -class TestUDTF: -def eval(self, a: int): -yield a, a + 1 - -with self.assertRaisesRegex(SparkConnectGrpcException, err_msg): -TestUDTF(lit(1)).collect() - -def