[GitHub] [spark] LuciferYang commented on a diff in pull request #37721: [SPARK-40272][CORE]Support service port custom with range
LuciferYang commented on code in PR #37721: URL: https://github.com/apache/spark/pull/37721#discussion_r973041128 ## core/src/main/scala/org/apache/spark/internal/config/package.scala: ## @@ -2429,4 +2429,18 @@ package object config { .version("3.4.0") .timeConf(TimeUnit.MILLISECONDS) .createWithDefaultString("5s") + + private[spark] val CUSTOM_SERVICE_PORT_ORIGIN = Review Comment: > If need limited to (4, 5], the range can be controlled by the conf of spark.port.maxRetries. I'm not sure whether using `spark.port.maxRetries` to control the port range is a good idea, which looks more like a trick -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #37909: [SPARK-40468][SQL] Fix column pruning in CSV when _corrupt_record is selected
AmplabJenkins commented on PR #37909: URL: https://github.com/apache/spark/pull/37909#issuecomment-1249391483 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #37909: [SPARK-40468][SQL] Fix column pruning in CSV when _corrupt_record is selected
MaxGekk commented on PR #37909: URL: https://github.com/apache/spark/pull/37909#issuecomment-1249401934 @sadikovi Thanks for the ping. I will look at it soon. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #37862: [MINOR][SQL] Remove an unnecessary parameter of the PartitionedFileUtil.splitFiles
LuciferYang commented on PR #37862: URL: https://github.com/apache/spark/pull/37862#issuecomment-1249419779 > Seems OK. There's no reason to expect external code would call this method right? Although this is not a public api, it is still used by third-party projects based on Spark, for example: - [NVIDIA/spark-rapids](https://github.com/NVIDIA/spark-rapids) https://github.com/NVIDIA/spark-rapids/blob/fb86a1a8042f241b31d29f2e48ef73820be734d7/sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuFileSourceScanExec.scala#L479-L485 -[gluten](https://github.com/oap-project/gluten) https://github.com/oap-project/gluten/blob/2e0f16bbdbba4edd70846123147e24d0b90ce833/jvm/src/main/scala/io/glutenproject/utils/InputPartitionsUtil.scala#L43-L51 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #37679: [SPARK-35242][SQL] Support changing session catalog's default database
cloud-fan commented on PR #37679: URL: https://github.com/apache/spark/pull/37679#issuecomment-1249421287 > override val defaultNamespace: Array[String] = Array(SQLConf.get.defaultDatabase) Yes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37679: [SPARK-35242][SQL] Support changing session catalog's default database
cloud-fan commented on code in PR #37679: URL: https://github.com/apache/spark/pull/37679#discussion_r973074675 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala: ## @@ -286,7 +284,7 @@ class SessionCatalog( def dropDatabase(db: String, ignoreIfNotExists: Boolean, cascade: Boolean): Unit = { val dbName = format(db) -if (dbName == DEFAULT_DATABASE) { +if (dbName == defaultDatabase) { Review Comment: This is different. Let's say the current database is `abc` and the default database is `xyz`. Can we drop `xyz`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax
cloud-fan commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r973076852 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -869,26 +873,55 @@ class Analyzer(override val catalogManager: CatalogManager) def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsWithPruning( _.containsPattern(UNPIVOT), ruleId) { - // once children and ids are resolved, we can determine values, if non were given - case up: Unpivot if up.childrenResolved && up.ids.forall(_.resolved) && up.values.isEmpty => -up.copy(values = up.child.output.diff(up.ids)) + // once children are resolved, we can determine values from ids and vice versa + // if only either is given + case up: Unpivot if up.childrenResolved && +up.ids.exists(_.forall(_.resolved)) && up.values.isEmpty => +up.copy(values = + Some( +up.child.output.diff(up.ids.get.flatMap(_.references)) Review Comment: I mean, how other databases infer the id columns? It's `allOutput -- valueColumns` or `allOutput -- valueColumns.flatMap(_.references)`? ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -869,26 +873,55 @@ class Analyzer(override val catalogManager: CatalogManager) def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsWithPruning( _.containsPattern(UNPIVOT), ruleId) { - // once children and ids are resolved, we can determine values, if non were given - case up: Unpivot if up.childrenResolved && up.ids.forall(_.resolved) && up.values.isEmpty => -up.copy(values = up.child.output.diff(up.ids)) + // once children are resolved, we can determine values from ids and vice versa + // if only either is given + case up: Unpivot if up.childrenResolved && +up.ids.exists(_.forall(_.resolved)) && up.values.isEmpty => +up.copy(values = + Some( +up.child.output.diff(up.ids.get.flatMap(_.references)) Review Comment: I mean, how do other databases infer the id columns? It's `allOutput -- valueColumns` or `allOutput -- valueColumns.flatMap(_.references)`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #37862: [MINOR][SQL] Remove an unnecessary parameter of the PartitionedFileUtil.splitFiles
srowen commented on PR #37862: URL: https://github.com/apache/spark/pull/37862#issuecomment-1249454630 OK let's leave it if there's any doubt - just not worth messing with libraries that use even non-public APIs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #37879: [SPARK-40425][SQL] DROP TABLE does not need to do table lookup
viirya commented on PR #37879: URL: https://github.com/apache/spark/pull/37879#issuecomment-1249565075 One pyspark error, although looks like a real failure, seems unrelated? ``` Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/pandas/tests/test_spark_functions.py", line 28, in test_repeat self.assertTrue(spark_column_equals(SF.repeat(F.lit(1), 2), F.repeat(F.lit(1), 2))) AssertionError: False is not true ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on pull request #37881: [SPARK-40169][SQL] Don't pushdown Parquet filters with no reference to data schema
sunchao commented on PR #37881: URL: https://github.com/apache/spark/pull/37881#issuecomment-1249640137 Thanks! merged to master/branch-3.3 (test failure unrelated). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao closed pull request #37881: [SPARK-40169][SQL] Don't pushdown Parquet filters with no reference to data schema
sunchao closed pull request #37881: [SPARK-40169][SQL] Don't pushdown Parquet filters with no reference to data schema URL: https://github.com/apache/spark/pull/37881 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37905: [SPARK-40460][SS] Fix streaming metrics when selecting `_metadata`
HeartSaVioR commented on code in PR #37905: URL: https://github.com/apache/spark/pull/37905#discussion_r973345153 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala: ## @@ -590,7 +591,7 @@ class MicroBatchExecution( val newBatchesPlan = logicalPlan transform { // For v1 sources. case StreamingExecutionRelation(source, output, catalogTable) => -newData.get(source).map { dataPlan => +mutableNewData.get(source).map { dataPlan => Review Comment: Maybe, we may want to check the case of self-union / self-join to verify we really didn't break things. This works only when this condition is true `leaf : source = 1 : 1` (otherwise we are overwriting the value in map), while the code comment of ProgressReporter tells there are counter cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #37905: [SPARK-40460][SS] Fix streaming metrics when selecting `_metadata`
AmplabJenkins commented on PR #37905: URL: https://github.com/apache/spark/pull/37905#issuecomment-1249515682 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huanliwang-db opened a new pull request, #37917: [WIP][SPARK-40466][SS] Improve the error message when DSv2 is disabled whi…
huanliwang-db opened a new pull request, #37917: URL: https://github.com/apache/spark/pull/37917 …le DSv1 is not avaliable. ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #37916: [SPARK-40473][SQL] Migrate parsing errors onto error classes
MaxGekk commented on code in PR #37916: URL: https://github.com/apache/spark/pull/37916#discussion_r973207994 ## sql/core/src/test/resources/sql-tests/results/comments.sql.out: ## @@ -132,20 +132,9 @@ select 1 as a struct<> -- !query output org.apache.spark.sql.catalyst.parser.ParseException - -Unclosed bracketed comment(line 3, pos 0) - -== SQL == -/*abc*/ -select 1 as a -/* -^^^ - -2 as b -/*abc*/ -, 3 as c - -/**/ +{ + "errorClass" : "_LEGACY_ERROR_TEMP_055" Review Comment: Just in case, users will see the same error message by default (not JSON). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #37916: [SPARK-40473][SQL] Migrate parsing errors onto error classes
MaxGekk commented on code in PR #37916: URL: https://github.com/apache/spark/pull/37916#discussion_r973207413 ## sql/core/src/test/resources/sql-tests/results/comments.sql.out: ## @@ -132,20 +132,9 @@ select 1 as a struct<> -- !query output org.apache.spark.sql.catalyst.parser.ParseException - -Unclosed bracketed comment(line 3, pos 0) - -== SQL == -/*abc*/ -select 1 as a -/* -^^^ - -2 as b -/*abc*/ -, 3 as c - -/**/ +{ + "errorClass" : "_LEGACY_ERROR_TEMP_055" Review Comment: Yes. We don't provide context in many places. This is one of them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #37881: [SPARK-40169][SQL] Don't pushdown Parquet filters with no reference to data schema
dongjoon-hyun commented on PR #37881: URL: https://github.com/apache/spark/pull/37881#issuecomment-1249698632 Thank you, @sunchao and all! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37905: [SPARK-40460][SS] Fix streaming metrics when selecting `_metadata`
HeartSaVioR commented on code in PR #37905: URL: https://github.com/apache/spark/pull/37905#discussion_r973335538 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala: ## @@ -590,7 +591,7 @@ class MicroBatchExecution( val newBatchesPlan = logicalPlan transform { // For v1 sources. case StreamingExecutionRelation(source, output, catalogTable) => -newData.get(source).map { dataPlan => +mutableNewData.get(source).map { dataPlan => Review Comment: Yeah, you're right. I missed that. Btw, looks like my change (tagging catalogTable into LogicalRelation) will also fall into this bug. Thanks for fixing this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yaohua628 commented on a diff in pull request #37905: [SPARK-40460][SS] Fix streaming metrics when selecting `_metadata`
Yaohua628 commented on code in PR #37905: URL: https://github.com/apache/spark/pull/37905#discussion_r973236942 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala: ## @@ -590,7 +591,7 @@ class MicroBatchExecution( val newBatchesPlan = logicalPlan transform { // For v1 sources. case StreamingExecutionRelation(source, output, catalogTable) => -newData.get(source).map { dataPlan => +mutableNewData.get(source).map { dataPlan => Review Comment: Thanks, I initially thought about that, but we need to know the `output` from `StreamingExecutionRelation(source, output, catalogTable)` to resolve `_metadata` right (L591 ~ L593)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huanliwang-db commented on pull request #37917: [SPARK-40466][SS] Improve the error message when DSv2 is disabled whi…
huanliwang-db commented on PR #37917: URL: https://github.com/apache/spark/pull/37917#issuecomment-1249708310 @HeartSaVioR Please review this change -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #37916: [SPARK-40473][SQL] Migrate parsing errors onto error classes
MaxGekk commented on code in PR #37916: URL: https://github.com/apache/spark/pull/37916#discussion_r973207413 ## sql/core/src/test/resources/sql-tests/results/comments.sql.out: ## @@ -132,20 +132,9 @@ select 1 as a struct<> -- !query output org.apache.spark.sql.catalyst.parser.ParseException - -Unclosed bracketed comment(line 3, pos 0) - -== SQL == -/*abc*/ -select 1 as a -/* -^^^ - -2 as b -/*abc*/ -, 3 as c - -/**/ +{ + "errorClass" : "_LEGACY_ERROR_TEMP_055" Review Comment: Yes. We don't provide context in many places (I mean QueryContext not line, pos). This is one of them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on a diff in pull request #37840: [SPARK-40416][SQL] Move subquery expression CheckAnalysis error messages to use the new error framework
dtenedor commented on code in PR #37840: URL: https://github.com/apache/spark/pull/37840#discussion_r973305583 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/package.scala: ## @@ -45,18 +51,101 @@ package object analysis { throw new AnalysisException(msg, t.origin.line, t.origin.startPosition) } -/** Fails the analysis at the point where a specific tree node was parsed. */ +/** Fails the analysis at the point where a specific tree node was parsed with a given cause. */ def failAnalysis(msg: String, cause: Throwable): Nothing = { throw new AnalysisException(msg, t.origin.line, t.origin.startPosition, cause = Some(cause)) } +/** + * Fails the analysis at the point where a specific tree node was parsed using a provided + * error class and message parameters. + */ def failAnalysis(errorClass: String, messageParameters: Map[String, String]): Nothing = { throw new AnalysisException( errorClass = errorClass, messageParameters = messageParameters, origin = t.origin) } +/** + * Fails the analysis at the point where a specific tree node was parsed using a provided + * error class and subclass and message parameters. + */ +def failAnalysis( +errorClass: String, +errorSubClass: String, +messageParameters: Map[String, String] = Map.empty[String, String]): Nothing = { + throw new AnalysisException( +errorClass = errorClass, +errorSubClass = errorSubClass, +messageParameters = messageParameters, +origin = t.origin) +} + +/** + * Fails the analysis at the point where a specific tree node was parsed using a provided + * error class and subclass and one message parameter comprising a plan string. The plan string + * will be printed in the error message if and only if the corresponding Spark configuration is + * enabled. + */ +def failAnalysis( +errorClass: String, +errorSubClass: String, +treeNodes: Seq[TreeNode[_]]): Nothing = { + // Normalize expression IDs in the query plan to keep tests deterministic. Review Comment: Hi @gengliangwang I looked at this at length some more, and I was able to reuse the existing plan `canonicalize` method after all using `AnalysisHelper.allowInvokingTransformsInAnalyzer` in one specific place instead. Now there is no need to introduce new code for this purpose. Please take another look. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #37899: [SPARK-40455][CORE]Abort result stage directly when it failed caused by FetchFailedException
AmplabJenkins commented on PR #37899: URL: https://github.com/apache/spark/pull/37899#issuecomment-1249717441 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #37916: [SPARK-40473][SQL] Migrate parsing errors onto error classes
MaxGekk commented on PR #37916: URL: https://github.com/apache/spark/pull/37916#issuecomment-1249473715 cc @srielau @anchovYu Could you take a look at the PR, please. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srielau commented on a diff in pull request #37916: [SPARK-40473][SQL] Migrate parsing errors onto error classes
srielau commented on code in PR #37916: URL: https://github.com/apache/spark/pull/37916#discussion_r973178090 ## sql/core/src/test/resources/sql-tests/results/comments.sql.out: ## @@ -132,20 +132,9 @@ select 1 as a struct<> -- !query output org.apache.spark.sql.catalyst.parser.ParseException - -Unclosed bracketed comment(line 3, pos 0) - -== SQL == -/*abc*/ -select 1 as a -/* -^^^ - -2 as b -/*abc*/ -, 3 as c - -/**/ +{ + "errorClass" : "_LEGACY_ERROR_TEMP_055" Review Comment: Did we loose the context? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #37840: [SPARK-40416][SQL] Move subquery expression CheckAnalysis error messages to use the new error framework
gengliangwang commented on code in PR #37840: URL: https://github.com/apache/spark/pull/37840#discussion_r973308641 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala: ## @@ -730,6 +729,13 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog { } } + private def canonicalizeForError(expr: LogicalPlan): String = +if (SQLConf.get.getConf(SQLConf.CANONICALIZE_PLANS_IN_ERRORS)) { Review Comment: Actually we don't need such a new config. We can just check ``` Utils.isTesting ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #37840: [SPARK-40416][SQL] Move subquery expression CheckAnalysis error messages to use the new error framework
gengliangwang commented on PR #37840: URL: https://github.com/apache/spark/pull/37840#issuecomment-1249703749 LGTM except one comment -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yaohua628 commented on a diff in pull request #37905: [SPARK-40460][SS] Fix streaming metrics when selecting `_metadata`
Yaohua628 commented on code in PR #37905: URL: https://github.com/apache/spark/pull/37905#discussion_r973339562 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala: ## @@ -590,7 +591,7 @@ class MicroBatchExecution( val newBatchesPlan = logicalPlan transform { // For v1 sources. case StreamingExecutionRelation(source, output, catalogTable) => -newData.get(source).map { dataPlan => +mutableNewData.get(source).map { dataPlan => Review Comment: Np - an unintentional fix :-) Thanks for helping! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax
EnricoMi commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r973273719 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -869,26 +873,55 @@ class Analyzer(override val catalogManager: CatalogManager) def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsWithPruning( _.containsPattern(UNPIVOT), ruleId) { - // once children and ids are resolved, we can determine values, if non were given - case up: Unpivot if up.childrenResolved && up.ids.forall(_.resolved) && up.values.isEmpty => -up.copy(values = up.child.output.diff(up.ids)) + // once children are resolved, we can determine values from ids and vice versa + // if only either is given + case up: Unpivot if up.childrenResolved && +up.ids.exists(_.forall(_.resolved)) && up.values.isEmpty => +up.copy(values = + Some( +up.child.output.diff(up.ids.get.flatMap(_.references)) Review Comment: The former. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yaohua628 commented on a diff in pull request #37905: [SPARK-40460][SS] Fix streaming metrics when selecting `_metadata`
Yaohua628 commented on code in PR #37905: URL: https://github.com/apache/spark/pull/37905#discussion_r973485968 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala: ## @@ -590,7 +591,7 @@ class MicroBatchExecution( val newBatchesPlan = logicalPlan transform { // For v1 sources. case StreamingExecutionRelation(source, output, catalogTable) => -newData.get(source).map { dataPlan => +mutableNewData.get(source).map { dataPlan => Review Comment: Got it - could you share an example? In this case, does that mean the `leaf : source = 1 : N`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] parthchandra commented on pull request #37558: [SPARK-38954][CORE] Implement sharing of cloud credentials among driver and executors
parthchandra commented on PR #37558: URL: https://github.com/apache/spark/pull/37558#issuecomment-1249925342 I like the idea of having an authentication agnostic credentials manager. I would have done it exactly as you are suggesting except that my knowledge of Kerberos is not very deep, and I have limited ability to test all the implementations that currently depend on HadoopDelegationTokenManager, so I decided to not touch the existing implementation (well, not too much anyway). Also, I am new to the SPIP process, so some guidance would be very welcome. I can start a document, but it will need some homework on my end before it can be discussed publicly. Meanwhile what would be your recommendation regarding this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37348: [SPARK-39854][SQL] replaceWithAliases should keep the original children for Generate
dongjoon-hyun commented on code in PR #37348: URL: https://github.com/apache/spark/pull/37348#discussion_r973512461 ## sql/core/src/test/scala/org/apache/spark/sql/execution/SparkPlanSuite.scala: ## @@ -143,6 +143,48 @@ class SparkPlanSuite extends QueryTest with SharedSparkSession { } } } + + test("SPARK-39854: replaceWithAliases should keep the order of Generate children") { Review Comment: Thank you for providing the end-to-end test. Can we have a test case in `NestedColumnAliasingSuite.scala` because we are touching `NestedColumnAliasing`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng opened a new pull request, #37918: [SPARK-40476][ML][SQL] Reduce the shuffle size of ALS
zhengruifeng opened a new pull request, #37918: URL: https://github.com/apache/spark/pull/37918 ### What changes were proposed in this pull request? implement a new expression `CollectTopK`, which uses `Array` instead of `BoundedPriorityQueue` in ser/deser ### Why are the changes needed? Reduce the shuffle size of ALS ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #37348: [SPARK-39854][SQL] replaceWithAliases should keep the original children for Generate
viirya commented on PR #37348: URL: https://github.com/apache/spark/pull/37348#issuecomment-1249961949 I'll take a look today or tomorrow. Thanks @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #37897: [SPARK-40445][PS] Refactor `Resampler` for consistency and simplicity
zhengruifeng commented on PR #37897: URL: https://github.com/apache/spark/pull/37897#issuecomment-1249934673 Merged into master, thank you @itholic @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #37897: [SPARK-40445][PS] Refactor `Resampler` for consistency and simplicity
zhengruifeng closed pull request #37897: [SPARK-40445][PS] Refactor `Resampler` for consistency and simplicity URL: https://github.com/apache/spark/pull/37897 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #37348: [SPARK-39854][SQL] replaceWithAliases should keep the original children for Generate
dongjoon-hyun commented on PR #37348: URL: https://github.com/apache/spark/pull/37348#issuecomment-1249962295 Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37879: [SPARK-40425][SQL] DROP TABLE does not need to do table lookup
dongjoon-hyun commented on code in PR #37879: URL: https://github.com/apache/spark/pull/37879#discussion_r973544556 ## core/src/test/scala/org/apache/spark/SparkFunSuite.scala: ## @@ -299,10 +299,15 @@ abstract class SparkFunSuite parameters: Map[String, String] = Map.empty, matchPVals: Boolean = false, queryContext: Array[QueryContext] = Array.empty): Unit = { -assert(exception.getErrorClass === errorClass) +val mainErrorClass :: tail = errorClass.split("\\.").toList +assert(tail.isEmpty || tail.length == 1) +// TODO: remove the `errorSubClass` parameter. Review Comment: just nit. If we use IDed TODO, some contributor can pick up the item more easily. ## core/src/test/scala/org/apache/spark/SparkFunSuite.scala: ## @@ -299,10 +299,15 @@ abstract class SparkFunSuite parameters: Map[String, String] = Map.empty, matchPVals: Boolean = false, queryContext: Array[QueryContext] = Array.empty): Unit = { -assert(exception.getErrorClass === errorClass) +val mainErrorClass :: tail = errorClass.split("\\.").toList +assert(tail.isEmpty || tail.length == 1) +// TODO: remove the `errorSubClass` parameter. Review Comment: just nit. If we use IDed TODO with JIRA id, some contributor can pick up the item more easily. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #37896: Revert [SPARK-24544][SQL] Print actual failure cause when look up function failed
cloud-fan commented on PR #37896: URL: https://github.com/apache/spark/pull/37896#issuecomment-1249973445 also cc @viirya -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #37896: Revert [SPARK-24544][SQL] Print actual failure cause when look up function failed
viirya commented on PR #37896: URL: https://github.com/apache/spark/pull/37896#issuecomment-1250005249 Oh, that's right. SPARK-24544 is long time ago, it is better to have a new JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #37892: [SPARK-40436][BUILD] Upgrade Scala to 2.12.17
dongjoon-hyun closed pull request #37892: [SPARK-40436][BUILD] Upgrade Scala to 2.12.17 URL: https://github.com/apache/spark/pull/37892 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #37899: [SPARK-40455][CORE]Abort result stage directly when it failed caused by FetchFailedException
mridulm commented on PR #37899: URL: https://github.com/apache/spark/pull/37899#issuecomment-1249909995 I will take a look at this PR hopefully next week. +CC @Ngone51 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #37913: [SPARK-40447][PS] Implement `kendall` correlation in `DataFrame.corr`
zhengruifeng closed pull request #37913: [SPARK-40447][PS] Implement `kendall` correlation in `DataFrame.corr` URL: https://github.com/apache/spark/pull/37913 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #37913: [SPARK-40447][PS] Implement `kendall` correlation in `DataFrame.corr`
zhengruifeng commented on PR #37913: URL: https://github.com/apache/spark/pull/37913#issuecomment-1249935182 Merged into master, thank you @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37905: [SPARK-40460][SS] Fix streaming metrics when selecting `_metadata`
HeartSaVioR commented on code in PR #37905: URL: https://github.com/apache/spark/pull/37905#discussion_r973510971 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala: ## @@ -590,7 +591,7 @@ class MicroBatchExecution( val newBatchesPlan = logicalPlan transform { // For v1 sources. case StreamingExecutionRelation(source, output, catalogTable) => -newData.get(source).map { dataPlan => +mutableNewData.get(source).map { dataPlan => Review Comment: The code comment actually doesn't say much and I'm speculating. Let's just try a best effort, self-union and self-join. `df = spark.readStream... -> df.union(df)` / `df = spark.readStream... -> df.join(df)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] alex-balikov commented on a diff in pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark
alex-balikov commented on code in PR #37893: URL: https://github.com/apache/spark/pull/37893#discussion_r973478091 ## python/pyspark/sql/pandas/group_ops.py: ## @@ -216,6 +218,105 @@ def applyInPandas( jdf = self._jgd.flatMapGroupsInPandas(udf_column._jc.expr()) return DataFrame(jdf, self.session) +def applyInPandasWithState( +self, +func: "PandasGroupedMapFunctionWithState", +outputStructType: Union[StructType, str], +stateStructType: Union[StructType, str], +outputMode: str, +timeoutConf: str, +) -> DataFrame: +""" +Applies the given function to each group of data, while maintaining a user-defined +per-group state. The result Dataset will represent the flattened record returned by the +function. + +For a streaming Dataset, the function will be invoked for each group repeatedly in every +trigger, and updates to each group's state will be saved across invocations. The function +will also be invoked for each timed-out state repeatedly. The sequence of the invocation +will be input data -> state timeout. When the function is invoked for state timeout, there +will be no data being presented. + +The function should takes parameters (key, Iterator[`pandas.DataFrame`], state) and +returns another Iterator[`pandas.DataFrame`]. The grouping key(s) will be passed as a tuple +of numpy data types, e.g., `numpy.int32` and `numpy.float64`. The state will be passed as +:class:`pyspark.sql.streaming.state.GroupStateImpl`. + +For each group, all columns are passed together as `pandas.DataFrame` to the user-function, +and the returned `pandas.DataFrame` across all invocations are combined as a +:class:`DataFrame`. Note that the user function should loop through and process all +elements in the iterator. The user function should not make a guess of the number of +elements in the iterator. + +The `outputStructType` should be a :class:`StructType` describing the schema of all +elements in returned value, `pandas.DataFrame`. The column labels of all elements in +returned value, `pandas.DataFrame` must either match the field names in the defined +schema if specified as strings, or match the field data types by position if not strings, +e.g. integer indices. + +The `stateStructType` should be :class:`StructType` describing the schema of user-defined +state. The value of state will be presented as a tuple, as well as the update should be +performed with the tuple. User defined types e.g. native Python class types are not +supported. Alternatively, you can pickle the data and produce the data as BinaryType, but Review Comment: 'Alternatively, you can pickle the data ...' - instead say 'For such cases, the user should pickle the data into BinaryType. Note that this approach may be sensitive to backwards and forward compatibility issues of Python picks and Spark can not guarantee compatibility. though I think you could drop the note as that is orthogonal to Spark. ## python/pyspark/sql/pandas/group_ops.py: ## @@ -216,6 +218,105 @@ def applyInPandas( jdf = self._jgd.flatMapGroupsInPandas(udf_column._jc.expr()) return DataFrame(jdf, self.session) +def applyInPandasWithState( +self, +func: "PandasGroupedMapFunctionWithState", +outputStructType: Union[StructType, str], +stateStructType: Union[StructType, str], +outputMode: str, +timeoutConf: str, +) -> DataFrame: +""" +Applies the given function to each group of data, while maintaining a user-defined +per-group state. The result Dataset will represent the flattened record returned by the +function. + +For a streaming Dataset, the function will be invoked for each group repeatedly in every +trigger, and updates to each group's state will be saved across invocations. The function +will also be invoked for each timed-out state repeatedly. The sequence of the invocation +will be input data -> state timeout. When the function is invoked for state timeout, there +will be no data being presented. + +The function should takes parameters (key, Iterator[`pandas.DataFrame`], state) and +returns another Iterator[`pandas.DataFrame`]. The grouping key(s) will be passed as a tuple +of numpy data types, e.g., `numpy.int32` and `numpy.float64`. The state will be passed as +:class:`pyspark.sql.streaming.state.GroupStateImpl`. + +For each group, all columns are passed together as `pandas.DataFrame` to the user-function, +and the returned `pandas.DataFrame` across all invocations are combined as a +:class:`DataFrame`. Note that
[GitHub] [spark] viirya commented on a diff in pull request #37896: Revert [SPARK-24544][SQL] Print actual failure cause when look up function failed
viirya commented on code in PR #37896: URL: https://github.com/apache/spark/pull/37896#discussion_r973528716 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala: ## @@ -1588,10 +1587,9 @@ class SessionCatalog( TableFunctionRegistry.builtin.functionExists(name) } - protected[sql] def failFunctionLookup( Review Comment: Hmm, where do we set `cause`? As I see, we always call `failFunctionLookup` without `cause` (except for the test case). ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala: ## @@ -74,19 +74,17 @@ case class NoSuchPartitionException( case class NoSuchPermanentFunctionException(db: String, func: String) extends AnalysisException(s"Function '$func' not found in database '$db'") -case class NoSuchFunctionException( -override val message: String, -override val cause: Option[Throwable]) - extends AnalysisException(message, cause = cause) { +case class NoSuchFunctionException(override val message: String) + extends AnalysisException(message) { - def this(db: String, func: String, cause: Option[Throwable] = None) = { + def this(db: String, func: String) = { this(s"Undefined function: '$func'. " + s"This function is neither a registered temporary function nor " + Review Comment: nit: ```suggestion "This function is neither a registered temporary function nor " + ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #37887: [SPARK-40360] [WIP] ALREADY_EXISTS and NOT_FOUND exceptions
AmplabJenkins commented on PR #37887: URL: https://github.com/apache/spark/pull/37887#issuecomment-1249988028 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a diff in pull request #37533: [SPARK-40096]Fix finalize shuffle stage slow due to connection creation slow
mridulm commented on code in PR #37533: URL: https://github.com/apache/spark/pull/37533#discussion_r973488467 ## core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala: ## @@ -4443,36 +4443,115 @@ class DAGSchedulerSuite extends SparkFunSuite with TempLocalSparkContext with Ti assert(mapStatuses.count(s => s != null && s.location.executorId == "hostB-exec") === 1) } - test("SPARK-40096: Send finalize events even if shuffle merger blocks indefinitely") { + test("SPARK-40096: Send finalize events even if shuffle merger blocks indefinitely " + +"with registerMergeResults is true") { initPushBasedShuffleConfs(conf) +sc.conf.set("spark.shuffle.push.results.timeout", "1s") +val myScheduler = new MyDAGScheduler( + sc, + taskScheduler, + sc.listenerBus, + mapOutputTracker, + blockManagerMaster, + sc.env, + shuffleMergeFinalize = false) + +val mergerLocs = Seq(makeBlockManagerId("hostA"), makeBlockManagerId("hostB")) +val timeoutSecs = 1 +val sendRequestsLatch = new CountDownLatch(mergerLocs.size) +val completeLatch = new CountDownLatch(mergerLocs.size) +val canSendRequestLatch = new CountDownLatch(1) + val blockStoreClient = mock(classOf[ExternalBlockStoreClient]) val blockStoreClientField = classOf[BlockManager].getDeclaredField("blockStoreClient") blockStoreClientField.setAccessible(true) blockStoreClientField.set(sc.env.blockManager, blockStoreClient) + val sentHosts = ArrayBuffer[String]() +var hostAInterrupted = false doAnswer { (invoke: InvocationOnMock) => val host = invoke.getArgument[String](0) - sentHosts += host - // Block FinalizeShuffleMerge rpc for 2 seconds - if (invoke.getArgument[String](0) == "hostA") { -Thread.sleep(2000) + sendRequestsLatch.countDown() + try { +if (host == "hostA") { + canSendRequestLatch.await(timeoutSecs * 2, TimeUnit.SECONDS) +} +sentHosts += host + } catch { +case _: InterruptedException => hostAInterrupted = true + } finally { +completeLatch.countDown() } }.when(blockStoreClient).finalizeShuffleMerge(any(), any(), any(), any(), any()) val shuffleMapRdd = new MyRDD(sc, 1, Nil) val shuffleDep = new ShuffleDependency(shuffleMapRdd, new HashPartitioner(2)) -shuffleDep.setMergerLocs(Seq(makeBlockManagerId("hostA"), makeBlockManagerId("hostB"))) -val shuffleStage = scheduler.createShuffleMapStage(shuffleDep, 0) - -Seq(true, false).foreach { registerMergeResults => - sentHosts.clear() - scheduler.finalizeShuffleMerge(shuffleStage, registerMergeResults) - verify(blockStoreClient, times(2)) -.finalizeShuffleMerge(any(), any(), any(), any(), any()) - assert((sentHosts diff Seq("hostA", "hostB")).isEmpty) - reset(blockStoreClient) -} +shuffleDep.setMergerLocs(mergerLocs) +val shuffleStage = myScheduler.createShuffleMapStage(shuffleDep, 0) + +myScheduler.finalizeShuffleMerge(shuffleStage, true) +sendRequestsLatch.await() +verify(blockStoreClient, times(2)) + .finalizeShuffleMerge(any(), any(), any(), any(), any()) +assert(sentHosts === Seq("hostB")) +completeLatch.await() +assert(hostAInterrupted) + } + + test("SPARK-40096: Send finalize events even if shuffle merger blocks indefinitely " + +"with registerMergeResults is false") { Review Comment: Can we merge this test with the previous one ? Essentially, something like: ``` Seq(true, false) { registerMergeResults => test("SPARK-40096: Send finalize events even if shuffle merger blocks indefinitely " + s"with registerMergeResults is $registerMergeResults") { myScheduler.finalizeShuffleMerge(shuffleStage, registerMergeResults) ... } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #37533: [SPARK-40096]Fix finalize shuffle stage slow due to connection creation slow
mridulm commented on PR #37533: URL: https://github.com/apache/spark/pull/37533#issuecomment-1249916683 +CC @otterc, @Ngone51 PTAL -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #37918: [SPARK-40476][ML][SQL] Reduce the shuffle size of ALS
zhengruifeng commented on PR #37918: URL: https://github.com/apache/spark/pull/37918#issuecomment-1249957261 take the [`ALSExample`](https://github.com/apache/spark/blob/e1ea806b3075d279b5f08a29fe4c1ad6d3c4191a/examples/src/main/scala/org/apache/spark/examples/ml/ALSExample.scala) for example: ``` import org.apache.spark.ml.recommendation._ case class Rating(userId: Int, movieId: Int, rating: Float, timestamp: Long) def parseRating(str: String): Rating = { val fields = str.split("::") assert(fields.size == 4) Rating(fields(0).toInt, fields(1).toInt, fields(2).toFloat, fields(3).toLong) } val ratings = spark.read.textFile("data/mllib/als/sample_movielens_ratings.txt").map(parseRating).toDF() val als = new ALS().setMaxIter(1).setRegParam(0.01).setUserCol("userId").setItemCol("movieId").setRatingCol("rating") val model = als.fit(ratings) model.recommendForAllItems(10).collect() ``` before: https://user-images.githubusercontent.com/7322292/190832964-3f31bcd4-6bfb-445a-a339-b415f82719e3.png;> after: https://user-images.githubusercontent.com/7322292/190832980-d0ccc2d3-f4d9-4801-9046-9d56f2fbab3c.png;> the shuffle size in this case was reduced from `298.4 KiB` to `130.3 KiB` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #37914: [SPARK-40471][BUILD] Upgrade RoaringBitmap to 0.9.32
dongjoon-hyun closed pull request #37914: [SPARK-40471][BUILD] Upgrade RoaringBitmap to 0.9.32 URL: https://github.com/apache/spark/pull/37914 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37885: [SPARK-40428][CORE][WIP] Fix shutdown hook in the CoarseGrainedSchedulerBackend
dongjoon-hyun commented on code in PR #37885: URL: https://github.com/apache/spark/pull/37885#discussion_r973545013 ## core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala: ## @@ -971,18 +971,30 @@ private[spark] class TaskSchedulerImpl( } override def stop(): Unit = { -speculationScheduler.shutdown() +Utils.tryLogNonFatalError { + speculationScheduler.shutdown() +} if (backend != null) { - backend.stop() + Utils.tryLogNonFatalError { +backend.stop() + } } if (taskResultGetter != null) { - taskResultGetter.stop() + Utils.tryLogNonFatalError { +taskResultGetter.stop() + } } if (barrierCoordinator != null) { - barrierCoordinator.stop() + Utils.tryLogNonFatalError { +barrierCoordinator.stop() + } +} +Utils.tryLogNonFatalError { + starvationTimer.cancel() +} +Utils.tryLogNonFatalError { + abortTimer.cancel() Review Comment: Does `Timer.cancel()` also require `Utils.tryLogNonFatalError`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #37916: [SPARK-40473][SQL] Migrate parsing errors onto error classes
MaxGekk commented on PR #37916: URL: https://github.com/apache/spark/pull/37916#issuecomment-1249570897 > You seem to assume < 1000 of these. But just this one PR consumes close to a hundred slots" Some time ago, I have counted the total number of exceptions to be ported onto error classes around 800. I don't see any problems to start using 4 digits after 999. > How do we keep the numbering straight for the next n PRs? We have 3 types of errors, actually: parsing, compilation and execution. We could migrate them 1-by-1 using sequential numbers. Otherwise the PR will conflict a lot. > I assume you used some tooling to whip this up so fast. I think it will be faster to manually port the exceptions without any tooling. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37905: [SPARK-40460][SS] Fix streaming metrics when selecting `_metadata`
HeartSaVioR commented on code in PR #37905: URL: https://github.com/apache/spark/pull/37905#discussion_r972646144 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala: ## @@ -524,10 +525,11 @@ class FileMetadataStructSuite extends QueryTest with SharedSparkSession { .select("*", "_metadata") .writeStream.format("json") .option("checkpointLocation", dir.getCanonicalPath + "/target/checkpoint") +.trigger(Trigger.Once()) Review Comment: nit: Please change to `Trigger.AvailableNow()` and see whether this breaks the test or not. We are going to produce deprecated warning Trigger.Once() from Spark 3.4.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37625: [SPARK-40177][SQL] Simplify condition of form (a==b) || (a==null&==null) to a<=>b
cloud-fan commented on code in PR #37625: URL: https://github.com/apache/spark/pull/37625#discussion_r972659580 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala: ## @@ -412,6 +412,16 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper { } } + case Or(EqualTo(l, r), And(IsNull(c1), IsNull(c2))) Review Comment: Assume that we have a chain of predicates combined by OR `cond1 OR cond2 OR cond3 OR ... condN`. I think we can merge `condX` and `condY` if they are `EqualTo(l, r)` and `And(IsNull(l), isNull(r))`. This is more general than the current approach. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37906: [SPARK-40463][INFRA] Update gpg's keyserver
dongjoon-hyun commented on code in PR #37906: URL: https://github.com/apache/spark/pull/37906#discussion_r972668048 ## dev/create-release/spark-rm/Dockerfile: ## @@ -53,7 +53,7 @@ ARG GEM_PKGS="bundler:2.2.9" # the most current package versions (instead of potentially using old versions cached by docker). RUN apt-get clean && apt-get update && $APT_INSTALL gnupg ca-certificates && \ echo 'deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/' >> /etc/apt/sources.list && \ - gpg --keyserver keyserver.ubuntu.com --recv-key E298A3A825C0D65DFD57CBB651716619E084DAB9 && \ + gpg --keyserver hkps://keyserver.ubuntu.com --recv-key E298A3A825C0D65DFD57CBB651716619E084DAB9 && \ Review Comment: We have the same code in another place. https://github.com/apache/spark/blob/32fbd7e83252f96df9c78f2f15f3917167821a12/dev/infra/Dockerfile#L52 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on a diff in pull request #37906: [SPARK-40463][INFRA] Update gpg's keyserver
wangyum commented on code in PR #37906: URL: https://github.com/apache/spark/pull/37906#discussion_r972670730 ## dev/create-release/spark-rm/Dockerfile: ## @@ -53,7 +53,7 @@ ARG GEM_PKGS="bundler:2.2.9" # the most current package versions (instead of potentially using old versions cached by docker). RUN apt-get clean && apt-get update && $APT_INSTALL gnupg ca-certificates && \ echo 'deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/' >> /etc/apt/sources.list && \ - gpg --keyserver keyserver.ubuntu.com --recv-key E298A3A825C0D65DFD57CBB651716619E084DAB9 && \ + gpg --keyserver hkps://keyserver.ubuntu.com --recv-key E298A3A825C0D65DFD57CBB651716619E084DAB9 && \ Review Comment: OK. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi opened a new pull request, #37911: [SPARK-40470] Handle GetArrayStructFields and GetMapValue in "arrays_zip" function
sadikovi opened a new pull request, #37911: URL: https://github.com/apache/spark/pull/37911 ### What changes were proposed in this pull request? This is a follow-up for https://github.com/apache/spark/pull/37833. The PR fixes column names in `arrays_zip` function for the cases when `GetArrayStructFields` and `GetMapValue` expressions are used (see unit tests for more details). Before the patch, the column names would be indexes or an AnalysisException would be thrown in the case of `GetArrayStructFields` example. ### Why are the changes needed? Fixes an inconsistency issue in Spark 3.2 and onwards where the fields would be labeled as indexes instead of column names. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added unit tests that reproduce the issue and confirmed that the patch fixes them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun commented on pull request #37888: [SPARK-40196][PYTHON][PS] Consolidate `lit` function with NumPy scalar in sql and pandas module
Yikun commented on PR #37888: URL: https://github.com/apache/spark/pull/37888#issuecomment-1248987419 ``` test_repeat (pyspark.pandas.tests.test_spark_functions.SparkFunctionsTests) ... FAIL (0.052s) == FAIL [0.052s]: test_repeat (pyspark.pandas.tests.test_spark_functions.SparkFunctionsTests) -- Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/pandas/tests/test_spark_functions.py", line 28, in test_repeat self.assertTrue(spark_column_equals(SF.repeat(F.lit(1), 2), F.repeat(F.lit(1), 2))) AssertionError: False is not true -- Ran 1 test in 8.471s ``` Still failed due to `test_repeat`, Initial invistigation: ``` F.repeat(F.lit(1), 2).__dict__ Out[3]: {'_jc': JavaObject id=o50} SF.repeat(F.lit(1), 2).__dict__ Out[4]: {'_jc': JavaObject id=o54} ``` According to https://github.com/apache/spark/pull/37888#discussion_r971408190 , Looks like we need to remove this test? ```python class SparkFunctionsTests(PandasOnSparkTestCase): def test_repeat(self): # TODO: Placeholder pass ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #37888: [SPARK-40196][PYTHON][PS] Consolidate `lit` function with NumPy scalar in sql and pandas module
zhengruifeng commented on PR #37888: URL: https://github.com/apache/spark/pull/37888#issuecomment-1248990425 @Yikun Let's comment this test for now, to unblock other PRs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #37906: [SPARK-40463][INFRA] Update gpg's keyserver
wangyum commented on PR #37906: URL: https://github.com/apache/spark/pull/37906#issuecomment-1248993445 > Oh, @wangyum . It seems that you made an accidental commit on the `master` branch. > > * [694cac6](https://github.com/apache/spark/commit/694cac63da3bfa651132eca9fee3278544616dc3) Sorry. I didn't which to my branch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun opened a new pull request, #37912: [SPARK-40196][PYTHON][PS][FOLLOWUP] SparkFunctionsTests.test_repeat
Yikun opened a new pull request, #37912: URL: https://github.com/apache/spark/pull/37912 ### What changes were proposed in this pull request? Mark `SparkFunctionsTests.test_repeat` as placeholder. ### Why are the changes needed? ``` test_repeat (pyspark.pandas.tests.test_spark_functions.SparkFunctionsTests) ... FAIL (0.052s) == FAIL [0.052s]: test_repeat (pyspark.pandas.tests.test_spark_functions.SparkFunctionsTests) -- Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/pandas/tests/test_spark_functions.py", line 28, in test_repeat self.assertTrue(spark_column_equals(SF.repeat(F.lit(1), 2), F.repeat(F.lit(1), 2))) AssertionError: False is not true -- Ran 1 test in 8.471s ``` According to https://github.com/apache/spark/pull/37888#discussion_r971408190 we'd better skip it first. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #37906: [SPARK-40463][INFRA] Update gpg's keyserver
dongjoon-hyun commented on PR #37906: URL: https://github.com/apache/spark/pull/37906#issuecomment-1248995872 Ya, the accident happens sometime. No worry. If CI succeeds, nobody gets hurt. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #37912: [SPARK-40196][PYTHON][PS][FOLLOWUP] Skip SparkFunctionsTests.test_repeat
zhengruifeng closed pull request #37912: [SPARK-40196][PYTHON][PS][FOLLOWUP] Skip SparkFunctionsTests.test_repeat URL: https://github.com/apache/spark/pull/37912 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #37906: [SPARK-40463][INFRA] Update gpg's keyserver
wangyum commented on PR #37906: URL: https://github.com/apache/spark/pull/37906#issuecomment-1249000489 Could we force push to overwrite that commit? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #37906: [SPARK-40463][INFRA] Update gpg's keyserver
HyukjinKwon closed pull request #37906: [SPARK-40463][INFRA] Update gpg's keyserver URL: https://github.com/apache/spark/pull/37906 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37906: [SPARK-40463][INFRA] Update gpg's keyserver
HyukjinKwon commented on PR #37906: URL: https://github.com/apache/spark/pull/37906#issuecomment-1249003622 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #37906: [SPARK-40463][INFRA] Update gpg's keyserver
dongjoon-hyun commented on PR #37906: URL: https://github.com/apache/spark/pull/37906#issuecomment-1249003476 Yep, reverting is possible. I'll leave it to you, @HyukjinKwon . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng opened a new pull request, #37913: [SPARK-40447][PS] Implement `kendall` correlation in `DataFrame.corr`
zhengruifeng opened a new pull request, #37913: URL: https://github.com/apache/spark/pull/37913 ### What changes were proposed in this pull request? Implement `kendall` correlation in `DataFrame.corr` ### Why are the changes needed? for API coverage ### Does this PR introduce _any_ user-facing change? yes, new correlation option: ``` In [1]: import pyspark.pandas as ps In [2]: df = ps.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)], columns=['dogs', 'cats']) In [3]: df.corr('kendall') dogs cats dogs 1.00 -0.912871 cats -0.912871 1.00 In [4]: df.to_pandas().corr('kendall') /Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/utils.py:975: PandasAPIOnSparkAdviceWarning: `to_pandas` loads all data into the driver's memory. It should only be used if the resulting pandas DataFrame is expected to be small. warnings.warn(message, PandasAPIOnSparkAdviceWarning) Out[4]: dogs cats dogs 1.00 -0.912871 cats -0.912871 1.00 ``` ### How was this patch tested? added UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun commented on a diff in pull request #37843: [SPARK-40398][CORE][SQL] Use Loop instead of Arrays.stream api
Yikun commented on code in PR #37843: URL: https://github.com/apache/spark/pull/37843#discussion_r972651836 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Expression.java: ## @@ -44,7 +46,12 @@ public interface Expression { * List of fields or columns that are referenced by this expression. */ default NamedReference[] references() { -return Arrays.stream(children()).map(e -> e.references()) - .flatMap(Arrays::stream).distinct().toArray(NamedReference[]::new); +// SPARK-40398: Replace `Arrays.stream()...distinct()` +// to this for perf gain, the result order is not important. +Set set = new HashSet<>(); +for (Expression e : children()) { + Collections.addAll(set, e.references()); +} +return set.toArray(new NamedReference[0]); Review Comment: https://github.com/apache/spark/commit/254bd80278843b3bc13584ca2f04391a770a78c7 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum opened a new pull request, #37910: [SPARK-40469][CORE] Avoid creating directory failures
wangyum opened a new pull request, #37910: URL: https://github.com/apache/spark/pull/37910 ### What changes were proposed in this pull request? This PR replace `Files.createDirectory` with `Files.createDirectories`. ### Why are the changes needed? To avoid creating directory failures if the parent directory removed by YARN: ``` java.nio.file.NoSuchFileException: /hadoop/3/yarn/local/usercache//appcache/application_1654776504115_37917/blockmgr-e18b484f-8c49-4c7d-b649-710439b0e4c3/3c at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384) at java.nio.file.Files.createDirectory(Files.java:674) at org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:123) at org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:146) at org.apache.spark.storage.DiskStore.contains(DiskStore.scala:147) at org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:853) at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$4(TorrentBroadcast.scala:253) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$2(TorrentBroadcast.scala:250) at org.apache.spark.util.KeyLock.withLock(KeyLock.scala:64) at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$1(TorrentBroadcast.scala:245) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1383) at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:245) at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:109) at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:132) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:487) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1417) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:490) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37906: [SPARK-40463][INFRA] Update gpg's keyserver
dongjoon-hyun commented on code in PR #37906: URL: https://github.com/apache/spark/pull/37906#discussion_r972669095 ## dev/create-release/spark-rm/Dockerfile: ## @@ -53,7 +53,7 @@ ARG GEM_PKGS="bundler:2.2.9" # the most current package versions (instead of potentially using old versions cached by docker). RUN apt-get clean && apt-get update && $APT_INSTALL gnupg ca-certificates && \ echo 'deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/' >> /etc/apt/sources.list && \ - gpg --keyserver keyserver.ubuntu.com --recv-key E298A3A825C0D65DFD57CBB651716619E084DAB9 && \ + gpg --keyserver hkps://keyserver.ubuntu.com --recv-key E298A3A825C0D65DFD57CBB651716619E084DAB9 && \ Review Comment: Both images are based on `Ubuntu 20.04` (Focal Fossa). If we need to change it, we had better consistent. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #37906: [SPARK-40463][INFRA] Update gpg's keyserver
dongjoon-hyun commented on PR #37906: URL: https://github.com/apache/spark/pull/37906#issuecomment-1248991651 Oh, @wangyum . It seems that you made an accidental commit on the `master` branch. - https://github.com/apache/spark/commit/694cac63da3bfa651132eca9fee3278544616dc3 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #37906: [SPARK-40463][INFRA] Update gpg's keyserver
dongjoon-hyun commented on PR #37906: URL: https://github.com/apache/spark/pull/37906#issuecomment-1249001715 No force-push, @wangyum . We already have another commit on top of yours. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #37912: [SPARK-40196][PYTHON][PS][FOLLOWUP] Skip SparkFunctionsTests.test_repeat
dongjoon-hyun commented on PR #37912: URL: https://github.com/apache/spark/pull/37912#issuecomment-1249001096 +1 for the swift decision. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37905: [SPARK-40460][SS] Fix streaming metrics when selecting `_metadata`
HeartSaVioR commented on code in PR #37905: URL: https://github.com/apache/spark/pull/37905#discussion_r972722394 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala: ## @@ -590,7 +591,7 @@ class MicroBatchExecution( val newBatchesPlan = logicalPlan transform { // For v1 sources. case StreamingExecutionRelation(source, output, catalogTable) => -newData.get(source).map { dataPlan => +mutableNewData.get(source).map { dataPlan => val hasFileMetadata = output.exists { Review Comment: This requires Source to indicate the request of metadata column and produce the logical plan accordingly when getBatch is called. My understanding is that DSv1 source does not have an interface to receive the information of which columns will be referred in actual query. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #37843: [SPARK-40398][CORE][SQL] Use Loop instead of Arrays.stream api
LuciferYang commented on code in PR #37843: URL: https://github.com/apache/spark/pull/37843#discussion_r972648962 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Expression.java: ## @@ -44,7 +46,12 @@ public interface Expression { * List of fields or columns that are referenced by this expression. */ default NamedReference[] references() { -return Arrays.stream(children()).map(e -> e.references()) - .flatMap(Arrays::stream).distinct().toArray(NamedReference[]::new); +// SPARK-40398: Replace `Arrays.stream()...distinct()` +// to this for perf gain, the result order is not important. +Set set = new HashSet<>(); +for (Expression e : children()) { + Collections.addAll(set, e.references()); +} +return set.toArray(new NamedReference[0]); Review Comment: Change to Python linter check failed... https://github.com/LuciferYang/spark/actions/runs/3065275580/jobs/4949221030 ``` starting mypy annotations test... annotations failed mypy checks: python/pyspark/pandas/window.py:112: error: Module has no attribute "lit" [attr-defined] Found 1 error in 1 file (checked 340 source files) 1 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37905: [SPARK-40460][SS] Fix streaming metrics when selecting `_metadata`
cloud-fan commented on code in PR #37905: URL: https://github.com/apache/spark/pull/37905#discussion_r972655128 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala: ## @@ -590,7 +591,7 @@ class MicroBatchExecution( val newBatchesPlan = logicalPlan transform { // For v1 sources. case StreamingExecutionRelation(source, output, catalogTable) => -newData.get(source).map { dataPlan => +mutableNewData.get(source).map { dataPlan => val hasFileMetadata = output.exists { Review Comment: looking at the code, seems the problem is we resolve the metadata columns in every micro-batch. Shouldn't we only resolve it once? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #37906: [SPARK-40463][INFRA] Update gpg's keyserver
wangyum commented on PR #37906: URL: https://github.com/apache/spark/pull/37906#issuecomment-1249002542 OK -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37906: [SPARK-40463][INFRA] Update gpg's keyserver
HyukjinKwon commented on PR #37906: URL: https://github.com/apache/spark/pull/37906#issuecomment-1249002932 Let me just revert and merge this PR in (just for the sake of trackability). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37905: [SPARK-40460][SS] Fix streaming metrics when selecting `_metadata`
HeartSaVioR commented on code in PR #37905: URL: https://github.com/apache/spark/pull/37905#discussion_r972711729 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala: ## @@ -590,7 +591,7 @@ class MicroBatchExecution( val newBatchesPlan = logicalPlan transform { // For v1 sources. case StreamingExecutionRelation(source, output, catalogTable) => -newData.get(source).map { dataPlan => +mutableNewData.get(source).map { dataPlan => Review Comment: While we are here, probably less intrusive change would be moving (L594 ~ L610) to L567. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark
HyukjinKwon commented on PR #37893: URL: https://github.com/apache/spark/pull/37893#issuecomment-1249226867 Will take a close look next Monday in KST. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37710: [SPARK-40448][CONNECT] Spark Connect build as Driver Plugin with Shaded Dependencies
HyukjinKwon commented on PR #37710: URL: https://github.com/apache/spark/pull/37710#issuecomment-1249269885 I am thinking about merging it without making major changes in this PR if there aren't major issues found, and I myself will take a look for important/urgent items very soon according to the plan described above. I would like to be transparent here. The frank reasons that I think as above are as follows: - Multiple people will intensively cowork together for this component but individual works in a different timezone which makes it difficult to work within this @martin-g's branch. - Difficult to manage the credibility. The whole size of work would be very huge, and I would like to avoid sharing the same credit with all the coauthors. Different person will sign off and be the author for individual change. - I would like to speed up by fully leveraging individual fork's GitHub Actions resources. Currently, @martin-g's GitHub resource here is a bottleneck. Hope this plan and thought make sense to other committers too. Are you guys okay with this? @dongjoon-hyun @viirya @mridulm @srowen @wangyum @sunchao @huaxingao (derived from SPIP voting) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #37743: [SPARK-40294][SQL] Fix repeat calls to `PartitionReader.hasNext` timing out
cloud-fan commented on PR #37743: URL: https://github.com/apache/spark/pull/37743#issuecomment-1249181442 ping @richardc-db -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax
EnricoMi commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r972869981 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -869,26 +873,55 @@ class Analyzer(override val catalogManager: CatalogManager) def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsWithPruning( _.containsPattern(UNPIVOT), ruleId) { - // once children and ids are resolved, we can determine values, if non were given - case up: Unpivot if up.childrenResolved && up.ids.forall(_.resolved) && up.values.isEmpty => -up.copy(values = up.child.output.diff(up.ids)) + // once children are resolved, we can determine values from ids and vice versa + // if only either is given + case up: Unpivot if up.childrenResolved && +up.ids.exists(_.forall(_.resolved)) && up.values.isEmpty => +up.copy(values = + Some( +up.child.output.diff(up.ids.get.flatMap(_.references)) Review Comment: Scala and Python API both require ids. SQL API does not allow to specify ids, which is the same in other SQL databases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax
EnricoMi commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r972871859 ## python/pyspark/sql/dataframe.py: ## @@ -3064,7 +3064,7 @@ def cube(self, *cols: "ColumnOrName") -> "GroupedData": # type: ignore[misc] def unpivot( self, -ids: Optional[Union["ColumnOrName", List["ColumnOrName"], Tuple["ColumnOrName", ...]]], +ids: Union["ColumnOrName", List["ColumnOrName"], Tuple["ColumnOrName", ...]], Review Comment: However, `pyspark.pandas.frame.melt` allows for `None` for `ids`, having the meaning of `[]`, while `values` being `None` means magically take all non-id columns: ```python def melt( self, id_vars: Optional[Union[Name, List[Name]]] = None, value_vars: Optional[Union[Name, List[Name]]] = None, var_name: Optional[Union[str, List[str]]] = None, value_name: str = "value", ) -> "DataFrame": """ ... Parameters -- frame : DataFrame id_vars : tuple, list, or ndarray, optional Column(s) to use as identifier variables. value_vars : tuple, list, or ndarray, optional Column(s) to unpivot. If not specified, uses all columns that are not set as `id_vars`. ``` Should Python API be consistent with Scala API or PySpark Pandas API? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] eejbyfeldt commented on a diff in pull request #37837: [SPARK-40385][SQL] Fix interpreted path for companion object constructor
eejbyfeldt commented on code in PR #37837: URL: https://github.com/apache/spark/pull/37837#discussion_r971579395 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ObjectExpressionsSuite.scala: ## @@ -423,7 +423,7 @@ class ObjectExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { inputTypes = Nil, propagateNull = false, dataType = ObjectType(classOf[ScroogeLikeExample]), - outerPointer = Some(() => outerObj)) + outerPointer = None) Review Comment: So normal inner classes (that uses outer pointer) are tested here in the ExpressionEncoderSuite: https://github.com/apache/spark/blob/ae08787f5c50e485ef4432a0c2da8b3b7290d725/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala#L229-L233 and here in the ObjectExpressionsSuite: https://github.com/apache/spark/blob/ae08787f5c50e485ef4432a0c2da8b3b7290d725/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ObjectExpressionsSuite.scala#L408-L417 so such cases are covered. Are you saying that we should add a test case for a class that only have an companion apply constructor? Trying to do something like that will not work in either this branch or master. This is because companion objects of inner classes are not singletons and the codegen will fail with that `"MODULE$" is neither a method, a field` because of this. Such a class would also behave slightly differently as the apply method constructor would not take an outerPointer. This is because the companion object already has an outer pointer and that will be used when creating the inner class object. Maybe it would be possible to add support for such cases but it would require more changes and is probably out of scope for this PR. But just to be clear *both* the test and the code was wrong before this PR and they were wrong in such a way were they cancelled out. And the new spec in ExpressionEncoderSuite in this PR that tests at a "higher level" shows also that the previous code was wrong as that test case will fail on master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37710: [SPARK-40448][CONNECT] Spark Connect build as Driver Plugin with Shaded Dependencies
HyukjinKwon commented on PR #37710: URL: https://github.com/apache/spark/pull/37710#issuecomment-1249261232 This is ready for a look now. Since the whole feature and codes would be very large, we (explicitly I, @martin-g, @amaliujia, and @cloud-fan) discussed offline, and decided to propose to split this. This PR is basically the minimal working version note that most of code lines here were generated from the protobuf. SPARK-39375 is a parent JIRA, and we described the current action items at this moment. More JIRAs will be filed accordingly to the plan below: ### High-level plan and design: - [High-Level Design Doc for Spark Connect](https://docs.google.com/document/d/17X6-P5H2522SnE-gF1BVwyildp_PDX8oXD-4l9vqQmA/edit?usp=sharing) - [Spark Connect API Testing Plan](https://docs.google.com/document/d/1n6EgS5vcmbwJUs5KGX4PzjKZVcSKd0qf0gLNZ6NFvOE/edit?usp=sharing) ### Low-level plan: **Short-term** - Extend test coverage for SparkConnectPlanner (right now at 76% line coverage) - Extend test coverage for Spark Connect Python client - Type annotations for Spark Connect Python client to re-enable mypy - Clean-up documentation in PySpark code for Spark Connect - Documentation for PySpark in README and doctests - Proto validation in server and/or client - Validation: - Syntactic -> Parsing - Semantic -> Analysis - Alternatively only return error class to clients upon failures. - Initial DSL framework for protobuf testing - Restructure the build structure to match with other components - Maven - SBT **Long-term** - Testing with custom DSL - `LocalRelation` - Better error handling for semantic failures - Spark and Session configurations - Scala Client - SBT incremental build and testing environment - DataSources - UDFs - Packaging / Releasing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang opened a new pull request, #37914: [SPARK-40471][BUILD] Upgrade RoaringBitmap to 0.9.32
LuciferYang opened a new pull request, #37914: URL: https://github.com/apache/spark/pull/37914 ### What changes were proposed in this pull request? This pr aims upgrade RoaringBitmap 0.9.32 ### Why are the changes needed? This is a bug fix version: - https://github.com/RoaringBitmap/RoaringBitmap/issues/575 - https://github.com/RoaringBitmap/RoaringBitmap/pull/578 other changes as follows: - https://github.com/RoaringBitmap/RoaringBitmap/compare/0.9.31...0.9.32 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #37914: [SPARK-40471][BUILD] Upgrade RoaringBitmap to 0.9.32
LuciferYang commented on PR #37914: URL: https://github.com/apache/spark/pull/37914#issuecomment-1249044491 will check MapStatusesConvertBenchmark result later -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax
cloud-fan commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r972808194 ## python/pyspark/sql/dataframe.py: ## @@ -3091,12 +3098,12 @@ def unpivot( Parameters -- -ids : str, Column, tuple, list, optional +ids : str, Column, tuple, list Column(s) to use as identifiers. Can be a single column or column name, or a list or tuple for multiple columns. values : str, Column, tuple, list, optional Column(s) to unpivot. Can be a single column or column name, or a list or tuple -for multiple columns. If not specified or empty, uses all columns that +for multiple columns. Must not be empty. If None, uses all columns that Review Comment: ```suggestion for multiple columns. If specified, must not be empty. If not specified, uses all columns that ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax
cloud-fan commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r972833735 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -869,26 +873,55 @@ class Analyzer(override val catalogManager: CatalogManager) def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsWithPruning( _.containsPattern(UNPIVOT), ruleId) { - // once children and ids are resolved, we can determine values, if non were given - case up: Unpivot if up.childrenResolved && up.ids.forall(_.resolved) && up.values.isEmpty => -up.copy(values = up.child.output.diff(up.ids)) + // once children are resolved, we can determine values from ids and vice versa + // if only either is given + case up: Unpivot if up.childrenResolved && +up.ids.exists(_.forall(_.resolved)) && up.values.isEmpty => +up.copy(values = + Some( +up.child.output.diff(up.ids.get.flatMap(_.references)) Review Comment: Oh I see, it's to match the behavior of when id columns are not specified. It's better to make them consistent, but it's also important to match both Pandas behavior and SQL behavior. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR closed pull request #37907: [SPARK-40467][SS] Split FlatMapGroupsWithState down to multiple test suites
HeartSaVioR closed pull request #37907: [SPARK-40467][SS] Split FlatMapGroupsWithState down to multiple test suites URL: https://github.com/apache/spark/pull/37907 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax
EnricoMi commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r972844049 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -869,26 +873,55 @@ class Analyzer(override val catalogManager: CatalogManager) def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsWithPruning( _.containsPattern(UNPIVOT), ruleId) { - // once children and ids are resolved, we can determine values, if non were given - case up: Unpivot if up.childrenResolved && up.ids.forall(_.resolved) && up.values.isEmpty => -up.copy(values = up.child.output.diff(up.ids)) + // once children are resolved, we can determine values from ids and vice versa + // if only either is given + case up: Unpivot if up.childrenResolved && +up.ids.exists(_.forall(_.resolved)) && up.values.isEmpty => +up.copy(values = + Some( +up.child.output.diff(up.ids.get.flatMap(_.references)) Review Comment: Yes, this implements `.unpivot(Array, String, String)`, when no value are given, similar to when no ids are given in SQL. This is the implementation of ``` * Note: A column that is referenced by an id column expression is considered an id column itself. * For instance `$"id" * 2` references column `id`, so `id` is considered an id column and not a * value column ``` I thought, the following would be annoying: ``` scala> val df = spark.range(5).select($"id", ($"id"*10).as("val")) scala> df.show() +---+---+ | id|val| +---+---+ | 0| 0| | 1| 10| | 2| 20| | 3| 30| | 4| 40| +---+---+ df.unpivot(Array($"id" * 2), "col", "val").show() ++---+---+ |(id * 2)|col|val| ++---+---+ | 0|id| 0| | 0|val| 0| | 2|id| 1| | 2|val| 10| | 4|id 2| | 4|val| 20| | 6|id 3| | 6|val| 30| | 8|id 4| | 8|val| 40| ++---+---+ ``` Of course, that id manipulation can be done before `unpivot` to "materialize" it as a reference: ``` df.withColumn("id", $"id" * 2).unpivot(Array($"id"), "col", "val").show() +---+---+---+ | id|col|val| +---+---+---+ | 0|val| 0| | 2|val| 10| | 4|val| 20| | 6|val| 30| | 8|val| 40| +---+---+---+ ``` Happy to remove that complexity. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax
cloud-fan commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r972860753 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala: ## @@ -1374,32 +1374,104 @@ case class Pivot( override protected def withNewChildInternal(newChild: LogicalPlan): Pivot = copy(child = newChild) } +/** + * Expression for [[Unpivot]] for one unpivot value column (one or more expressions) + * and an optional alias. This node itself is not evaluable and resolvable. + * Only its children are to be resolved. + * + * @param exprs expressions to unpivot + * @param alias optional alias + */ +case class UnpivotExpr(exprs: Seq[NamedExpression], alias: Option[String]) extends Unevaluable { + override val children: Seq[NamedExpression] = exprs + override def dataType: DataType = throw new UnresolvedException("dataType") + override def nullable: Boolean = throw new UnresolvedException("nullable") + // override lazy val resolved = false + + override protected def withNewChildrenInternal( + newChildren: IndexedSeq[Expression]): Expression = { +// turn expressions into named expressions +copy(exprs = newChildren.map { + case ne: NamedExpression => ne + case e: Expression => UnresolvedAlias(e) +}) + } +} + /** * A constructor for creating an Unpivot, which will later be converted to an [[Expand]] * during the query analysis. * - * An empty values array will be replaced during analysis with all resolved outputs of child except + * Either ids or values array must be set. The ids array can be empty, + * the values array must not be empty if not None. + * + * A None ids array will be replaced during analysis with all resolved outputs of child except + * the values. This expansion allows to easily select all non-value columns as id columns. + * + * A None values array will be replaced during analysis with all resolved outputs of child except * the ids. This expansion allows to easily unpivot all non-id columns. * * @see `org.apache.spark.sql.catalyst.analysis.Analyzer.ResolveUnpivot` * - * The type of the value column is derived from all value columns during analysis once all values - * are resolved. All values' types have to be compatible, otherwise the result value column cannot - * be assigned the individual values and an AnalysisException is thrown. + * Multiple columns can be unpivoted in one row by providing multiple value column names + * and the same number of unpivot value expressions: + * {{{ + * // one-dimensional value columns + * Unpivot( + * Some(Seq("id")), + * Some(Seq( + * (Seq("val1"), None), + * (Seq("val2"), None) + * )), + * "var", + * Seq("val") + * ) + * + * // two-dimensional value columns + * Unpivot( + * Some(Seq("id")), + * Some(Seq( + * (Seq("val1.1", "val1.2"), None), + * (Seq("val2.1", "val2.2"), None) + * )), + * "var", + * Seq("val1", "val2") + * ) + * }}} + * + * The variable column will contain the name of the unpivot value while the value columns contain + * the unpivot values. Multi-dimensional unpivot values can be given `aliases`: + * }}} + * // two-dimensional value columns with aliases + * Unpivot( + * Some(Seq("id")), + * Some(Seq( + * (Seq("val1.1", "val1.2"), Some("val1")), + * (Seq("val2.1", "val2.2"), Some("val2")) + * )), + * "var", + * Seq("val1", "val2") + * ) + * }}} + * + * All "value" columns must share a least common data type. Unless they are the same data type, + * all "value" columns are cast to the nearest common data type. For instance, + * types `IntegerType` and `LongType` are cast to `LongType`, while `IntegerType` and `StringType` + * do not have a common data type and `unpivot` fails with an `AnalysisException`. * * @see `org.apache.spark.sql.catalyst.analysis.TypeCoercionBase.UnpivotCoercion` * * @param idsId columns - * @param values Value columns to unpivot + * @param values Value column sets to unpivot with optional aliases * @param variableColumnName Name of the variable column - * @param valueColumnNameName of the value column + * @param valueColumnNames Names of the value columns * @param child Child operator */ case class Unpivot( -ids: Seq[NamedExpression], -values: Seq[NamedExpression], +ids: Option[Seq[NamedExpression]], +values: Option[Seq[UnpivotExpr]], variableColumnName: String, -valueColumnName: String, +valueColumnNames: Seq[String], child: LogicalPlan) extends UnaryNode { override lazy val resolved = false // Unpivot will be replaced after being resolved. Review Comment: let's add `assert(ids.isDefined || values.isDefined)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to
[GitHub] [spark] cloud-fan commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax
cloud-fan commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r972862941 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -869,26 +873,55 @@ class Analyzer(override val catalogManager: CatalogManager) def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsWithPruning( _.containsPattern(UNPIVOT), ruleId) { - // once children and ids are resolved, we can determine values, if non were given - case up: Unpivot if up.childrenResolved && up.ids.forall(_.resolved) && up.values.isEmpty => -up.copy(values = up.child.output.diff(up.ids)) + // once children are resolved, we can determine values from ids and vice versa + // if only either is given + case up: Unpivot if up.childrenResolved && +up.ids.exists(_.forall(_.resolved)) && up.values.isEmpty => +up.copy(values = + Some( +up.child.output.diff(up.ids.get.flatMap(_.references)) Review Comment: For DataFrame API, we'd better follow pandas even if it's annoying... SQL API is a different story. Do other databases have the same behavior as this PR does when id columns are not specified? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37905: [SPARK-40460][SS] Fix streaming metrics when selecting `_metadata`
HeartSaVioR commented on code in PR #37905: URL: https://github.com/apache/spark/pull/37905#discussion_r972711729 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala: ## @@ -590,7 +591,7 @@ class MicroBatchExecution( val newBatchesPlan = logicalPlan transform { // For v1 sources. case StreamingExecutionRelation(source, output, catalogTable) => -newData.get(source).map { dataPlan => +mutableNewData.get(source).map { dataPlan => Review Comment: While we are here, probably less intrusive change would be moving (L594 ~ L610) to L567. After the change we wouldn't need to make a change to newData here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37905: [SPARK-40460][SS] Fix streaming metrics when selecting `_metadata`
HeartSaVioR commented on code in PR #37905: URL: https://github.com/apache/spark/pull/37905#discussion_r972722394 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala: ## @@ -590,7 +591,7 @@ class MicroBatchExecution( val newBatchesPlan = logicalPlan transform { // For v1 sources. case StreamingExecutionRelation(source, output, catalogTable) => -newData.get(source).map { dataPlan => +mutableNewData.get(source).map { dataPlan => val hasFileMetadata = output.exists { Review Comment: It will require Source to indicate the request of metadata column and produce the logical plan accordingly when getBatch is called. My understanding is that DSv1 source does not have an interface to receive the information of which columns will be referred in actual query. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax
EnricoMi commented on code in PR #37407: URL: https://github.com/apache/spark/pull/37407#discussion_r972790708 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala: ## @@ -208,6 +208,7 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]] } def recursiveTransform(arg: Any): AnyRef = arg match { + case ue: UnpivotExpr => ue.withNewChildren(ue.exprs.map(transformExpression)) Review Comment: This is a very special treatment in a very generic method. Not nice, but the best way I could find to make `mapExpressions` tranform `UnpivotExpr.exprs` as a top-level expression. See comment on top-level expression in PR discussion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org