date:20220916

[GitHub] [spark] LuciferYang commented on a diff in pull request #37721: [SPARK-40272][CORE]Support service port custom with range

2022-09-16 Thread GitBox



LuciferYang commented on code in PR #37721:
URL: https://github.com/apache/spark/pull/37721#discussion_r973041128


##
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##
@@ -2429,4 +2429,18 @@ package object config {
   .version("3.4.0")
   .timeConf(TimeUnit.MILLISECONDS)
   .createWithDefaultString("5s")
+
+  private[spark] val CUSTOM_SERVICE_PORT_ORIGIN =

Review Comment:
   > If need limited to (4, 5], the range can be controlled by the conf 
of spark.port.maxRetries.
   
   I'm not sure whether using `spark.port.maxRetries` to control the port range 
is a good idea, which looks more like a trick
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #37909: [SPARK-40468][SQL] Fix column pruning in CSV when _corrupt_record is selected

2022-09-16 Thread GitBox



AmplabJenkins commented on PR #37909:
URL: https://github.com/apache/spark/pull/37909#issuecomment-1249391483

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #37909: [SPARK-40468][SQL] Fix column pruning in CSV when _corrupt_record is selected

2022-09-16 Thread GitBox



MaxGekk commented on PR #37909:
URL: https://github.com/apache/spark/pull/37909#issuecomment-1249401934

   @sadikovi Thanks for the ping. I will look at it soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #37862: [MINOR][SQL] Remove an unnecessary parameter of the PartitionedFileUtil.splitFiles

2022-09-16 Thread GitBox



LuciferYang commented on PR #37862:
URL: https://github.com/apache/spark/pull/37862#issuecomment-1249419779

   > Seems OK. There's no reason to expect external code would call this method 
right?
   
   Although this is not a public api, it is still used by third-party projects 
based on Spark, for example:
   
   - [NVIDIA/spark-rapids](https://github.com/NVIDIA/spark-rapids)
   
   
https://github.com/NVIDIA/spark-rapids/blob/fb86a1a8042f241b31d29f2e48ef73820be734d7/sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuFileSourceScanExec.scala#L479-L485
   
   -[gluten](https://github.com/oap-project/gluten)
   
   
https://github.com/oap-project/gluten/blob/2e0f16bbdbba4edd70846123147e24d0b90ce833/jvm/src/main/scala/io/glutenproject/utils/InputPartitionsUtil.scala#L43-L51


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #37679: [SPARK-35242][SQL] Support changing session catalog's default database

2022-09-16 Thread GitBox



cloud-fan commented on PR #37679:
URL: https://github.com/apache/spark/pull/37679#issuecomment-1249421287

   > override val defaultNamespace: Array[String] = 
Array(SQLConf.get.defaultDatabase)
   
   Yes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37679: [SPARK-35242][SQL] Support changing session catalog's default database

2022-09-16 Thread GitBox



cloud-fan commented on code in PR #37679:
URL: https://github.com/apache/spark/pull/37679#discussion_r973074675


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala:
##
@@ -286,7 +284,7 @@ class SessionCatalog(
 
   def dropDatabase(db: String, ignoreIfNotExists: Boolean, cascade: Boolean): 
Unit = {
 val dbName = format(db)
-if (dbName == DEFAULT_DATABASE) {
+if (dbName == defaultDatabase) {

Review Comment:
   This is different. Let's say the current database is `abc` and the default 
database is `xyz`. Can we drop `xyz`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-16 Thread GitBox



cloud-fan commented on code in PR #37407:
URL: https://github.com/apache/spark/pull/37407#discussion_r973076852


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -869,26 +873,55 @@ class Analyzer(override val catalogManager: 
CatalogManager)
 def apply(plan: LogicalPlan): LogicalPlan = 
plan.resolveOperatorsWithPruning(
   _.containsPattern(UNPIVOT), ruleId) {
 
-  // once children and ids are resolved, we can determine values, if non 
were given
-  case up: Unpivot if up.childrenResolved && up.ids.forall(_.resolved) && 
up.values.isEmpty =>
-up.copy(values = up.child.output.diff(up.ids))
+  // once children are resolved, we can determine values from ids and vice 
versa
+  // if only either is given
+  case up: Unpivot if up.childrenResolved &&
+up.ids.exists(_.forall(_.resolved)) && up.values.isEmpty =>
+up.copy(values =
+  Some(
+up.child.output.diff(up.ids.get.flatMap(_.references))

Review Comment:
   I mean, how other databases infer the id columns? It's `allOutput -- 
valueColumns` or `allOutput -- valueColumns.flatMap(_.references)`?



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -869,26 +873,55 @@ class Analyzer(override val catalogManager: 
CatalogManager)
 def apply(plan: LogicalPlan): LogicalPlan = 
plan.resolveOperatorsWithPruning(
   _.containsPattern(UNPIVOT), ruleId) {
 
-  // once children and ids are resolved, we can determine values, if non 
were given
-  case up: Unpivot if up.childrenResolved && up.ids.forall(_.resolved) && 
up.values.isEmpty =>
-up.copy(values = up.child.output.diff(up.ids))
+  // once children are resolved, we can determine values from ids and vice 
versa
+  // if only either is given
+  case up: Unpivot if up.childrenResolved &&
+up.ids.exists(_.forall(_.resolved)) && up.values.isEmpty =>
+up.copy(values =
+  Some(
+up.child.output.diff(up.ids.get.flatMap(_.references))

Review Comment:
   I mean, how do other databases infer the id columns? It's `allOutput -- 
valueColumns` or `allOutput -- valueColumns.flatMap(_.references)`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on pull request #37862: [MINOR][SQL] Remove an unnecessary parameter of the PartitionedFileUtil.splitFiles

2022-09-16 Thread GitBox



srowen commented on PR #37862:
URL: https://github.com/apache/spark/pull/37862#issuecomment-1249454630

   OK let's leave it if there's any doubt - just not worth messing with 
libraries that use even non-public APIs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya commented on pull request #37879: [SPARK-40425][SQL] DROP TABLE does not need to do table lookup

2022-09-16 Thread GitBox



viirya commented on PR #37879:
URL: https://github.com/apache/spark/pull/37879#issuecomment-1249565075

   One pyspark error, although looks like a real failure, seems unrelated?
   
   ```
Traceback (most recent call last):
 File 
"/__w/spark/spark/python/pyspark/pandas/tests/test_spark_functions.py", line 
28, in test_repeat
   self.assertTrue(spark_column_equals(SF.repeat(F.lit(1), 2), 
F.repeat(F.lit(1), 2)))
   AssertionError: False is not true
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] sunchao commented on pull request #37881: [SPARK-40169][SQL] Don't pushdown Parquet filters with no reference to data schema

2022-09-16 Thread GitBox



sunchao commented on PR #37881:
URL: https://github.com/apache/spark/pull/37881#issuecomment-1249640137

   Thanks! merged to master/branch-3.3 (test failure unrelated).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] sunchao closed pull request #37881: [SPARK-40169][SQL] Don't pushdown Parquet filters with no reference to data schema

2022-09-16 Thread GitBox



sunchao closed pull request #37881: [SPARK-40169][SQL] Don't pushdown Parquet 
filters with no reference to data schema
URL: https://github.com/apache/spark/pull/37881


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37905: [SPARK-40460][SS] Fix streaming metrics when selecting `_metadata`

2022-09-16 Thread GitBox



HeartSaVioR commented on code in PR #37905:
URL: https://github.com/apache/spark/pull/37905#discussion_r973345153


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala:
##
@@ -590,7 +591,7 @@ class MicroBatchExecution(
 val newBatchesPlan = logicalPlan transform {
   // For v1 sources.
   case StreamingExecutionRelation(source, output, catalogTable) =>
-newData.get(source).map { dataPlan =>
+mutableNewData.get(source).map { dataPlan =>

Review Comment:
   Maybe, we may want to check the case of self-union / self-join to verify we 
really didn't break things. This works only when this condition is true `leaf : 
source = 1 : 1` (otherwise we are overwriting the value in map), while the code 
comment of ProgressReporter tells there are counter cases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #37905: [SPARK-40460][SS] Fix streaming metrics when selecting `_metadata`

2022-09-16 Thread GitBox



AmplabJenkins commented on PR #37905:
URL: https://github.com/apache/spark/pull/37905#issuecomment-1249515682

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huanliwang-db opened a new pull request, #37917: [WIP][SPARK-40466][SS] Improve the error message when DSv2 is disabled whi…

2022-09-16 Thread GitBox



huanliwang-db opened a new pull request, #37917:
URL: https://github.com/apache/spark/pull/37917

   …le DSv1 is not avaliable.
   
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on a diff in pull request #37916: [SPARK-40473][SQL] Migrate parsing errors onto error classes

2022-09-16 Thread GitBox



MaxGekk commented on code in PR #37916:
URL: https://github.com/apache/spark/pull/37916#discussion_r973207994


##
sql/core/src/test/resources/sql-tests/results/comments.sql.out:
##
@@ -132,20 +132,9 @@ select 1 as a
 struct<>
 -- !query output
 org.apache.spark.sql.catalyst.parser.ParseException
-
-Unclosed bracketed comment(line 3, pos 0)
-
-== SQL ==
-/*abc*/
-select 1 as a
-/*
-^^^
-
-2 as b
-/*abc*/
-, 3 as c
-
-/**/
+{
+  "errorClass" : "_LEGACY_ERROR_TEMP_055"

Review Comment:
   Just in case, users will see the same error message by default (not JSON).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on a diff in pull request #37916: [SPARK-40473][SQL] Migrate parsing errors onto error classes

2022-09-16 Thread GitBox



MaxGekk commented on code in PR #37916:
URL: https://github.com/apache/spark/pull/37916#discussion_r973207413


##
sql/core/src/test/resources/sql-tests/results/comments.sql.out:
##
@@ -132,20 +132,9 @@ select 1 as a
 struct<>
 -- !query output
 org.apache.spark.sql.catalyst.parser.ParseException
-
-Unclosed bracketed comment(line 3, pos 0)
-
-== SQL ==
-/*abc*/
-select 1 as a
-/*
-^^^
-
-2 as b
-/*abc*/
-, 3 as c
-
-/**/
+{
+  "errorClass" : "_LEGACY_ERROR_TEMP_055"

Review Comment:
   Yes. We don't provide context in many places. This is one of them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #37881: [SPARK-40169][SQL] Don't pushdown Parquet filters with no reference to data schema

2022-09-16 Thread GitBox



dongjoon-hyun commented on PR #37881:
URL: https://github.com/apache/spark/pull/37881#issuecomment-1249698632

   Thank you, @sunchao and all!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37905: [SPARK-40460][SS] Fix streaming metrics when selecting `_metadata`

2022-09-16 Thread GitBox



HeartSaVioR commented on code in PR #37905:
URL: https://github.com/apache/spark/pull/37905#discussion_r973335538


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala:
##
@@ -590,7 +591,7 @@ class MicroBatchExecution(
 val newBatchesPlan = logicalPlan transform {
   // For v1 sources.
   case StreamingExecutionRelation(source, output, catalogTable) =>
-newData.get(source).map { dataPlan =>
+mutableNewData.get(source).map { dataPlan =>

Review Comment:
   Yeah, you're right. I missed that.
   
   Btw, looks like my change (tagging catalogTable into LogicalRelation) will 
also fall into this bug. Thanks for fixing this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Yaohua628 commented on a diff in pull request #37905: [SPARK-40460][SS] Fix streaming metrics when selecting `_metadata`

2022-09-16 Thread GitBox



Yaohua628 commented on code in PR #37905:
URL: https://github.com/apache/spark/pull/37905#discussion_r973236942


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala:
##
@@ -590,7 +591,7 @@ class MicroBatchExecution(
 val newBatchesPlan = logicalPlan transform {
   // For v1 sources.
   case StreamingExecutionRelation(source, output, catalogTable) =>
-newData.get(source).map { dataPlan =>
+mutableNewData.get(source).map { dataPlan =>

Review Comment:
   Thanks, I initially thought about that, but we need to know the `output` 
from `StreamingExecutionRelation(source, output, catalogTable)` to resolve 
`_metadata` right (L591 ~ L593)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huanliwang-db commented on pull request #37917: [SPARK-40466][SS] Improve the error message when DSv2 is disabled whi…

2022-09-16 Thread GitBox



huanliwang-db commented on PR #37917:
URL: https://github.com/apache/spark/pull/37917#issuecomment-1249708310

   @HeartSaVioR Please review this change


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on a diff in pull request #37916: [SPARK-40473][SQL] Migrate parsing errors onto error classes

2022-09-16 Thread GitBox



MaxGekk commented on code in PR #37916:
URL: https://github.com/apache/spark/pull/37916#discussion_r973207413


##
sql/core/src/test/resources/sql-tests/results/comments.sql.out:
##
@@ -132,20 +132,9 @@ select 1 as a
 struct<>
 -- !query output
 org.apache.spark.sql.catalyst.parser.ParseException
-
-Unclosed bracketed comment(line 3, pos 0)
-
-== SQL ==
-/*abc*/
-select 1 as a
-/*
-^^^
-
-2 as b
-/*abc*/
-, 3 as c
-
-/**/
+{
+  "errorClass" : "_LEGACY_ERROR_TEMP_055"

Review Comment:
   Yes. We don't provide context in many places (I mean QueryContext not line, 
pos). This is one of them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dtenedor commented on a diff in pull request #37840: [SPARK-40416][SQL] Move subquery expression CheckAnalysis error messages to use the new error framework

2022-09-16 Thread GitBox



dtenedor commented on code in PR #37840:
URL: https://github.com/apache/spark/pull/37840#discussion_r973305583


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/package.scala:
##
@@ -45,18 +51,101 @@ package object analysis {
   throw new AnalysisException(msg, t.origin.line, t.origin.startPosition)
 }
 
-/** Fails the analysis at the point where a specific tree node was parsed. 
*/
+/** Fails the analysis at the point where a specific tree node was parsed 
with a given cause. */
 def failAnalysis(msg: String, cause: Throwable): Nothing = {
   throw new AnalysisException(msg, t.origin.line, t.origin.startPosition, 
cause = Some(cause))
 }
 
+/**
+ * Fails the analysis at the point where a specific tree node was parsed 
using a provided
+ * error class and message parameters.
+ */
 def failAnalysis(errorClass: String, messageParameters: Map[String, 
String]): Nothing = {
   throw new AnalysisException(
 errorClass = errorClass,
 messageParameters = messageParameters,
 origin = t.origin)
 }
 
+/**
+ * Fails the analysis at the point where a specific tree node was parsed 
using a provided
+ * error class and subclass and message parameters.
+ */
+def failAnalysis(
+errorClass: String,
+errorSubClass: String,
+messageParameters: Map[String, String] = Map.empty[String, String]): 
Nothing = {
+  throw new AnalysisException(
+errorClass = errorClass,
+errorSubClass = errorSubClass,
+messageParameters = messageParameters,
+origin = t.origin)
+}
+
+/**
+ * Fails the analysis at the point where a specific tree node was parsed 
using a provided
+ * error class and subclass and one message parameter comprising a plan 
string. The plan string
+ * will be printed in the error message if and only if the corresponding 
Spark configuration is
+ * enabled.
+ */
+def failAnalysis(
+errorClass: String,
+errorSubClass: String,
+treeNodes: Seq[TreeNode[_]]): Nothing = {
+  // Normalize expression IDs in the query plan to keep tests 
deterministic.

Review Comment:
   Hi @gengliangwang I looked at this at length some more, and I was able to 
reuse the existing plan `canonicalize` method after all using 
`AnalysisHelper.allowInvokingTransformsInAnalyzer` in one specific place 
instead. Now there is no need to introduce new code for this purpose. Please 
take another look.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #37899: [SPARK-40455][CORE]Abort result stage directly when it failed caused by FetchFailedException

2022-09-16 Thread GitBox



AmplabJenkins commented on PR #37899:
URL: https://github.com/apache/spark/pull/37899#issuecomment-1249717441

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #37916: [SPARK-40473][SQL] Migrate parsing errors onto error classes

2022-09-16 Thread GitBox



MaxGekk commented on PR #37916:
URL: https://github.com/apache/spark/pull/37916#issuecomment-1249473715

   cc @srielau @anchovYu Could you take a look at the PR, please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srielau commented on a diff in pull request #37916: [SPARK-40473][SQL] Migrate parsing errors onto error classes

2022-09-16 Thread GitBox



srielau commented on code in PR #37916:
URL: https://github.com/apache/spark/pull/37916#discussion_r973178090


##
sql/core/src/test/resources/sql-tests/results/comments.sql.out:
##
@@ -132,20 +132,9 @@ select 1 as a
 struct<>
 -- !query output
 org.apache.spark.sql.catalyst.parser.ParseException
-
-Unclosed bracketed comment(line 3, pos 0)
-
-== SQL ==
-/*abc*/
-select 1 as a
-/*
-^^^
-
-2 as b
-/*abc*/
-, 3 as c
-
-/**/
+{
+  "errorClass" : "_LEGACY_ERROR_TEMP_055"

Review Comment:
   Did we loose the context?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #37840: [SPARK-40416][SQL] Move subquery expression CheckAnalysis error messages to use the new error framework

2022-09-16 Thread GitBox



gengliangwang commented on code in PR #37840:
URL: https://github.com/apache/spark/pull/37840#discussion_r973308641


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala:
##
@@ -730,6 +729,13 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog {
 }
   }
 
+  private def canonicalizeForError(expr: LogicalPlan): String =
+if (SQLConf.get.getConf(SQLConf.CANONICALIZE_PLANS_IN_ERRORS)) {

Review Comment:
   Actually we don't need such a new config. We can just check 
   ```
   Utils.isTesting
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gengliangwang commented on pull request #37840: [SPARK-40416][SQL] Move subquery expression CheckAnalysis error messages to use the new error framework

2022-09-16 Thread GitBox



gengliangwang commented on PR #37840:
URL: https://github.com/apache/spark/pull/37840#issuecomment-1249703749

   LGTM except one comment 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Yaohua628 commented on a diff in pull request #37905: [SPARK-40460][SS] Fix streaming metrics when selecting `_metadata`

2022-09-16 Thread GitBox



Yaohua628 commented on code in PR #37905:
URL: https://github.com/apache/spark/pull/37905#discussion_r973339562


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala:
##
@@ -590,7 +591,7 @@ class MicroBatchExecution(
 val newBatchesPlan = logicalPlan transform {
   // For v1 sources.
   case StreamingExecutionRelation(source, output, catalogTable) =>
-newData.get(source).map { dataPlan =>
+mutableNewData.get(source).map { dataPlan =>

Review Comment:
   Np - an unintentional fix :-)
   Thanks for helping!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] EnricoMi commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-16 Thread GitBox



EnricoMi commented on code in PR #37407:
URL: https://github.com/apache/spark/pull/37407#discussion_r973273719


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -869,26 +873,55 @@ class Analyzer(override val catalogManager: 
CatalogManager)
 def apply(plan: LogicalPlan): LogicalPlan = 
plan.resolveOperatorsWithPruning(
   _.containsPattern(UNPIVOT), ruleId) {
 
-  // once children and ids are resolved, we can determine values, if non 
were given
-  case up: Unpivot if up.childrenResolved && up.ids.forall(_.resolved) && 
up.values.isEmpty =>
-up.copy(values = up.child.output.diff(up.ids))
+  // once children are resolved, we can determine values from ids and vice 
versa
+  // if only either is given
+  case up: Unpivot if up.childrenResolved &&
+up.ids.exists(_.forall(_.resolved)) && up.values.isEmpty =>
+up.copy(values =
+  Some(
+up.child.output.diff(up.ids.get.flatMap(_.references))

Review Comment:
   The former.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Yaohua628 commented on a diff in pull request #37905: [SPARK-40460][SS] Fix streaming metrics when selecting `_metadata`

2022-09-16 Thread GitBox



Yaohua628 commented on code in PR #37905:
URL: https://github.com/apache/spark/pull/37905#discussion_r973485968


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala:
##
@@ -590,7 +591,7 @@ class MicroBatchExecution(
 val newBatchesPlan = logicalPlan transform {
   // For v1 sources.
   case StreamingExecutionRelation(source, output, catalogTable) =>
-newData.get(source).map { dataPlan =>
+mutableNewData.get(source).map { dataPlan =>

Review Comment:
   Got it - could you share an example? In this case, does that mean the `leaf 
: source = 1 : N`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] parthchandra commented on pull request #37558: [SPARK-38954][CORE] Implement sharing of cloud credentials among driver and executors

2022-09-16 Thread GitBox



parthchandra commented on PR #37558:
URL: https://github.com/apache/spark/pull/37558#issuecomment-1249925342

   I like the idea of having an authentication agnostic credentials manager. I 
would have done it exactly as you are suggesting except that my knowledge of 
Kerberos is not very deep, and I have limited ability to test all the 
implementations that currently depend on HadoopDelegationTokenManager, so I 
decided to not touch the existing implementation (well, not too much anyway). 
   Also, I am new to the SPIP process, so some guidance would be very welcome. 
I can start a document, but it will need some homework on my end before it can 
be discussed publicly. 
   Meanwhile what would be your recommendation regarding this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37348: [SPARK-39854][SQL] replaceWithAliases should keep the original children for Generate

2022-09-16 Thread GitBox



dongjoon-hyun commented on code in PR #37348:
URL: https://github.com/apache/spark/pull/37348#discussion_r973512461


##
sql/core/src/test/scala/org/apache/spark/sql/execution/SparkPlanSuite.scala:
##
@@ -143,6 +143,48 @@ class SparkPlanSuite extends QueryTest with 
SharedSparkSession {
   }
 }
   }
+
+  test("SPARK-39854: replaceWithAliases should keep the order of Generate 
children") {

Review Comment:
   Thank you for providing the end-to-end test. Can we have a test case in 
`NestedColumnAliasingSuite.scala` because we are touching 
`NestedColumnAliasing`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng opened a new pull request, #37918: [SPARK-40476][ML][SQL] Reduce the shuffle size of ALS

2022-09-16 Thread GitBox



zhengruifeng opened a new pull request, #37918:
URL: https://github.com/apache/spark/pull/37918

   ### What changes were proposed in this pull request?
   implement a new expression `CollectTopK`, which uses `Array` instead of 
`BoundedPriorityQueue` in ser/deser 
   
   
   ### Why are the changes needed?
   Reduce the shuffle size of ALS
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   existing testsuites
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya commented on pull request #37348: [SPARK-39854][SQL] replaceWithAliases should keep the original children for Generate

2022-09-16 Thread GitBox



viirya commented on PR #37348:
URL: https://github.com/apache/spark/pull/37348#issuecomment-1249961949

   I'll take a look today or tomorrow. Thanks @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #37897: [SPARK-40445][PS] Refactor `Resampler` for consistency and simplicity

2022-09-16 Thread GitBox



zhengruifeng commented on PR #37897:
URL: https://github.com/apache/spark/pull/37897#issuecomment-1249934673

   Merged into master, thank you @itholic @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng closed pull request #37897: [SPARK-40445][PS] Refactor `Resampler` for consistency and simplicity

2022-09-16 Thread GitBox



zhengruifeng closed pull request #37897: [SPARK-40445][PS] Refactor `Resampler` 
for consistency and simplicity
URL: https://github.com/apache/spark/pull/37897


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #37348: [SPARK-39854][SQL] replaceWithAliases should keep the original children for Generate

2022-09-16 Thread GitBox



dongjoon-hyun commented on PR #37348:
URL: https://github.com/apache/spark/pull/37348#issuecomment-1249962295

   Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37879: [SPARK-40425][SQL] DROP TABLE does not need to do table lookup

2022-09-16 Thread GitBox



dongjoon-hyun commented on code in PR #37879:
URL: https://github.com/apache/spark/pull/37879#discussion_r973544556


##
core/src/test/scala/org/apache/spark/SparkFunSuite.scala:
##
@@ -299,10 +299,15 @@ abstract class SparkFunSuite
   parameters: Map[String, String] = Map.empty,
   matchPVals: Boolean = false,
   queryContext: Array[QueryContext] = Array.empty): Unit = {
-assert(exception.getErrorClass === errorClass)
+val mainErrorClass :: tail = errorClass.split("\\.").toList
+assert(tail.isEmpty || tail.length == 1)
+// TODO: remove the `errorSubClass` parameter.

Review Comment:
   just nit. If we use IDed TODO, some contributor can pick up the item more 
easily.



##
core/src/test/scala/org/apache/spark/SparkFunSuite.scala:
##
@@ -299,10 +299,15 @@ abstract class SparkFunSuite
   parameters: Map[String, String] = Map.empty,
   matchPVals: Boolean = false,
   queryContext: Array[QueryContext] = Array.empty): Unit = {
-assert(exception.getErrorClass === errorClass)
+val mainErrorClass :: tail = errorClass.split("\\.").toList
+assert(tail.isEmpty || tail.length == 1)
+// TODO: remove the `errorSubClass` parameter.

Review Comment:
   just nit. If we use IDed TODO with JIRA id, some contributor can pick up the 
item more easily.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #37896: Revert [SPARK-24544][SQL] Print actual failure cause when look up function failed

2022-09-16 Thread GitBox



cloud-fan commented on PR #37896:
URL: https://github.com/apache/spark/pull/37896#issuecomment-1249973445

   also cc @viirya 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya commented on pull request #37896: Revert [SPARK-24544][SQL] Print actual failure cause when look up function failed

2022-09-16 Thread GitBox



viirya commented on PR #37896:
URL: https://github.com/apache/spark/pull/37896#issuecomment-1250005249

   Oh, that's right. SPARK-24544 is long time ago, it is better to have a new 
JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #37892: [SPARK-40436][BUILD] Upgrade Scala to 2.12.17

2022-09-16 Thread GitBox



dongjoon-hyun closed pull request #37892: [SPARK-40436][BUILD] Upgrade Scala to 
2.12.17
URL: https://github.com/apache/spark/pull/37892


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on pull request #37899: [SPARK-40455][CORE]Abort result stage directly when it failed caused by FetchFailedException

2022-09-16 Thread GitBox



mridulm commented on PR #37899:
URL: https://github.com/apache/spark/pull/37899#issuecomment-1249909995

   I will take a look at this PR hopefully next week.
   +CC @Ngone51 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng closed pull request #37913: [SPARK-40447][PS] Implement `kendall` correlation in `DataFrame.corr`

2022-09-16 Thread GitBox



zhengruifeng closed pull request #37913: [SPARK-40447][PS] Implement `kendall` 
correlation in `DataFrame.corr`
URL: https://github.com/apache/spark/pull/37913


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #37913: [SPARK-40447][PS] Implement `kendall` correlation in `DataFrame.corr`

2022-09-16 Thread GitBox



zhengruifeng commented on PR #37913:
URL: https://github.com/apache/spark/pull/37913#issuecomment-1249935182

   Merged into master, thank you @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37905: [SPARK-40460][SS] Fix streaming metrics when selecting `_metadata`

2022-09-16 Thread GitBox



HeartSaVioR commented on code in PR #37905:
URL: https://github.com/apache/spark/pull/37905#discussion_r973510971


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala:
##
@@ -590,7 +591,7 @@ class MicroBatchExecution(
 val newBatchesPlan = logicalPlan transform {
   // For v1 sources.
   case StreamingExecutionRelation(source, output, catalogTable) =>
-newData.get(source).map { dataPlan =>
+mutableNewData.get(source).map { dataPlan =>

Review Comment:
   The code comment actually doesn't say much and I'm speculating. Let's just 
try a best effort, self-union and self-join. `df = spark.readStream... -> 
df.union(df)` / `df = spark.readStream... -> df.join(df)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] alex-balikov commented on a diff in pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

2022-09-16 Thread GitBox



alex-balikov commented on code in PR #37893:
URL: https://github.com/apache/spark/pull/37893#discussion_r973478091


##
python/pyspark/sql/pandas/group_ops.py:
##
@@ -216,6 +218,105 @@ def applyInPandas(
 jdf = self._jgd.flatMapGroupsInPandas(udf_column._jc.expr())
 return DataFrame(jdf, self.session)
 
+def applyInPandasWithState(
+self,
+func: "PandasGroupedMapFunctionWithState",
+outputStructType: Union[StructType, str],
+stateStructType: Union[StructType, str],
+outputMode: str,
+timeoutConf: str,
+) -> DataFrame:
+"""
+Applies the given function to each group of data, while maintaining a 
user-defined
+per-group state. The result Dataset will represent the flattened 
record returned by the
+function.
+
+For a streaming Dataset, the function will be invoked for each group 
repeatedly in every
+trigger, and updates to each group's state will be saved across 
invocations. The function
+will also be invoked for each timed-out state repeatedly. The sequence 
of the invocation
+will be input data -> state timeout. When the function is invoked for 
state timeout, there
+will be no data being presented.
+
+The function should takes parameters (key, 
Iterator[`pandas.DataFrame`], state) and
+returns another Iterator[`pandas.DataFrame`]. The grouping key(s) will 
be passed as a tuple
+of numpy data types, e.g., `numpy.int32` and `numpy.float64`. The 
state will be passed as
+:class:`pyspark.sql.streaming.state.GroupStateImpl`.
+
+For each group, all columns are passed together as `pandas.DataFrame` 
to the user-function,
+and the returned `pandas.DataFrame` across all invocations are 
combined as a
+:class:`DataFrame`. Note that the user function should loop through 
and process all
+elements in the iterator. The user function should not make a guess of 
the number of
+elements in the iterator.
+
+The `outputStructType` should be a :class:`StructType` describing the 
schema of all
+elements in returned value, `pandas.DataFrame`. The column labels of 
all elements in
+returned value, `pandas.DataFrame` must either match the field names 
in the defined
+schema if specified as strings, or match the field data types by 
position if not strings,
+e.g. integer indices.
+
+The `stateStructType` should be :class:`StructType` describing the 
schema of user-defined
+state. The value of state will be presented as a tuple, as well as the 
update should be
+performed with the tuple. User defined types e.g. native Python class 
types are not
+supported. Alternatively, you can pickle the data and produce the data 
as BinaryType, but

Review Comment:
   'Alternatively, you can pickle the data ...' - instead say
   
   'For such cases, the user should pickle the data into BinaryType. Note that 
this approach may be sensitive to backwards and forward compatibility issues of 
Python picks and Spark can not guarantee compatibility.
   
   though I think you could drop the note as that is orthogonal to Spark.



##
python/pyspark/sql/pandas/group_ops.py:
##
@@ -216,6 +218,105 @@ def applyInPandas(
 jdf = self._jgd.flatMapGroupsInPandas(udf_column._jc.expr())
 return DataFrame(jdf, self.session)
 
+def applyInPandasWithState(
+self,
+func: "PandasGroupedMapFunctionWithState",
+outputStructType: Union[StructType, str],
+stateStructType: Union[StructType, str],
+outputMode: str,
+timeoutConf: str,
+) -> DataFrame:
+"""
+Applies the given function to each group of data, while maintaining a 
user-defined
+per-group state. The result Dataset will represent the flattened 
record returned by the
+function.
+
+For a streaming Dataset, the function will be invoked for each group 
repeatedly in every
+trigger, and updates to each group's state will be saved across 
invocations. The function
+will also be invoked for each timed-out state repeatedly. The sequence 
of the invocation
+will be input data -> state timeout. When the function is invoked for 
state timeout, there
+will be no data being presented.
+
+The function should takes parameters (key, 
Iterator[`pandas.DataFrame`], state) and
+returns another Iterator[`pandas.DataFrame`]. The grouping key(s) will 
be passed as a tuple
+of numpy data types, e.g., `numpy.int32` and `numpy.float64`. The 
state will be passed as
+:class:`pyspark.sql.streaming.state.GroupStateImpl`.
+
+For each group, all columns are passed together as `pandas.DataFrame` 
to the user-function,
+and the returned `pandas.DataFrame` across all invocations are 
combined as a
+:class:`DataFrame`. Note that

[GitHub] [spark] viirya commented on a diff in pull request #37896: Revert [SPARK-24544][SQL] Print actual failure cause when look up function failed

2022-09-16 Thread GitBox



viirya commented on code in PR #37896:
URL: https://github.com/apache/spark/pull/37896#discussion_r973528716


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala:
##
@@ -1588,10 +1587,9 @@ class SessionCatalog(
   TableFunctionRegistry.builtin.functionExists(name)
   }
 
-  protected[sql] def failFunctionLookup(

Review Comment:
   Hmm, where do we set `cause`? As I see, we always call `failFunctionLookup` 
without `cause` (except for the test case).



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala:
##
@@ -74,19 +74,17 @@ case class NoSuchPartitionException(
 case class NoSuchPermanentFunctionException(db: String, func: String)
   extends AnalysisException(s"Function '$func' not found in database '$db'")
 
-case class NoSuchFunctionException(
-override val message: String,
-override val cause: Option[Throwable])
-  extends AnalysisException(message, cause = cause) {
+case class NoSuchFunctionException(override val message: String)
+  extends AnalysisException(message) {
 
-  def this(db: String, func: String, cause: Option[Throwable] = None) = {
+  def this(db: String, func: String) = {
 this(s"Undefined function: '$func'. " +
 s"This function is neither a registered temporary function nor " +

Review Comment:
   nit:
   ```suggestion
   "This function is neither a registered temporary function nor " +
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #37887: [SPARK-40360] [WIP] ALREADY_EXISTS and NOT_FOUND exceptions

2022-09-16 Thread GitBox



AmplabJenkins commented on PR #37887:
URL: https://github.com/apache/spark/pull/37887#issuecomment-1249988028

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on a diff in pull request #37533: [SPARK-40096]Fix finalize shuffle stage slow due to connection creation slow

2022-09-16 Thread GitBox



mridulm commented on code in PR #37533:
URL: https://github.com/apache/spark/pull/37533#discussion_r973488467


##
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:
##
@@ -4443,36 +4443,115 @@ class DAGSchedulerSuite extends SparkFunSuite with 
TempLocalSparkContext with Ti
 assert(mapStatuses.count(s => s != null && s.location.executorId == 
"hostB-exec") === 1)
   }
 
-  test("SPARK-40096: Send finalize events even if shuffle merger blocks 
indefinitely") {
+  test("SPARK-40096: Send finalize events even if shuffle merger blocks 
indefinitely " +
+"with registerMergeResults is true") {
 initPushBasedShuffleConfs(conf)
 
+sc.conf.set("spark.shuffle.push.results.timeout", "1s")
+val myScheduler = new MyDAGScheduler(
+  sc,
+  taskScheduler,
+  sc.listenerBus,
+  mapOutputTracker,
+  blockManagerMaster,
+  sc.env,
+  shuffleMergeFinalize = false)
+
+val mergerLocs = Seq(makeBlockManagerId("hostA"), 
makeBlockManagerId("hostB"))
+val timeoutSecs = 1
+val sendRequestsLatch = new CountDownLatch(mergerLocs.size)
+val completeLatch = new CountDownLatch(mergerLocs.size)
+val canSendRequestLatch = new CountDownLatch(1)
+
 val blockStoreClient = mock(classOf[ExternalBlockStoreClient])
 val blockStoreClientField = 
classOf[BlockManager].getDeclaredField("blockStoreClient")
 blockStoreClientField.setAccessible(true)
 blockStoreClientField.set(sc.env.blockManager, blockStoreClient)
+
 val sentHosts = ArrayBuffer[String]()
+var hostAInterrupted = false
 doAnswer { (invoke: InvocationOnMock) =>
   val host = invoke.getArgument[String](0)
-  sentHosts += host
-  // Block FinalizeShuffleMerge rpc for 2 seconds
-  if (invoke.getArgument[String](0) == "hostA") {
-Thread.sleep(2000)
+  sendRequestsLatch.countDown()
+  try {
+if (host == "hostA") {
+  canSendRequestLatch.await(timeoutSecs * 2, TimeUnit.SECONDS)
+}
+sentHosts += host
+  } catch {
+case _: InterruptedException => hostAInterrupted = true
+  } finally {
+completeLatch.countDown()
   }
 }.when(blockStoreClient).finalizeShuffleMerge(any(), any(), any(), any(), 
any())
 
 val shuffleMapRdd = new MyRDD(sc, 1, Nil)
 val shuffleDep = new ShuffleDependency(shuffleMapRdd, new 
HashPartitioner(2))
-shuffleDep.setMergerLocs(Seq(makeBlockManagerId("hostA"), 
makeBlockManagerId("hostB")))
-val shuffleStage = scheduler.createShuffleMapStage(shuffleDep, 0)
-
-Seq(true, false).foreach { registerMergeResults =>
-  sentHosts.clear()
-  scheduler.finalizeShuffleMerge(shuffleStage, registerMergeResults)
-  verify(blockStoreClient, times(2))
-.finalizeShuffleMerge(any(), any(), any(), any(), any())
-  assert((sentHosts diff Seq("hostA", "hostB")).isEmpty)
-  reset(blockStoreClient)
-}
+shuffleDep.setMergerLocs(mergerLocs)
+val shuffleStage = myScheduler.createShuffleMapStage(shuffleDep, 0)
+
+myScheduler.finalizeShuffleMerge(shuffleStage, true)
+sendRequestsLatch.await()
+verify(blockStoreClient, times(2))
+  .finalizeShuffleMerge(any(), any(), any(), any(), any())
+assert(sentHosts === Seq("hostB"))
+completeLatch.await()
+assert(hostAInterrupted)
+  }
+
+  test("SPARK-40096: Send finalize events even if shuffle merger blocks 
indefinitely " +
+"with registerMergeResults is false") {

Review Comment:
   Can we merge this test with the previous one ?
   Essentially, something like:
   
   ```
   
   Seq(true, false) { registerMergeResults => 
 test("SPARK-40096: Send finalize events even if shuffle merger blocks 
indefinitely " +
 s"with registerMergeResults is $registerMergeResults") {
  
   
  myScheduler.finalizeShuffleMerge(shuffleStage, registerMergeResults)
   
 ...
   }
   ```
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on pull request #37533: [SPARK-40096]Fix finalize shuffle stage slow due to connection creation slow

2022-09-16 Thread GitBox



mridulm commented on PR #37533:
URL: https://github.com/apache/spark/pull/37533#issuecomment-1249916683

   +CC @otterc, @Ngone51 PTAL


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #37918: [SPARK-40476][ML][SQL] Reduce the shuffle size of ALS

2022-09-16 Thread GitBox



zhengruifeng commented on PR #37918:
URL: https://github.com/apache/spark/pull/37918#issuecomment-1249957261

   take the 
[`ALSExample`](https://github.com/apache/spark/blob/e1ea806b3075d279b5f08a29fe4c1ad6d3c4191a/examples/src/main/scala/org/apache/spark/examples/ml/ALSExample.scala)
 for example:
   
   ```
   import org.apache.spark.ml.recommendation._
   
   case class Rating(userId: Int, movieId: Int, rating: Float, timestamp: Long)
   
   def parseRating(str: String): Rating = {
   val fields = str.split("::")
   assert(fields.size == 4)
   Rating(fields(0).toInt, fields(1).toInt, fields(2).toFloat, 
fields(3).toLong)
   }
   
   val ratings = 
spark.read.textFile("data/mllib/als/sample_movielens_ratings.txt").map(parseRating).toDF()
   
   val als = new 
ALS().setMaxIter(1).setRegParam(0.01).setUserCol("userId").setItemCol("movieId").setRatingCol("rating")
   
   val model = als.fit(ratings)
   
   model.recommendForAllItems(10).collect()
   ```
   
   before:
   https://user-images.githubusercontent.com/7322292/190832964-3f31bcd4-6bfb-445a-a339-b415f82719e3.png;>
   
   
   
   after:
   https://user-images.githubusercontent.com/7322292/190832980-d0ccc2d3-f4d9-4801-9046-9d56f2fbab3c.png;>
   
   
   the shuffle size in this case was reduced from `298.4 KiB` to `130.3 KiB`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #37914: [SPARK-40471][BUILD] Upgrade RoaringBitmap to 0.9.32

2022-09-16 Thread GitBox



dongjoon-hyun closed pull request #37914: [SPARK-40471][BUILD] Upgrade 
RoaringBitmap to 0.9.32
URL: https://github.com/apache/spark/pull/37914


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37885: [SPARK-40428][CORE][WIP] Fix shutdown hook in the CoarseGrainedSchedulerBackend

2022-09-16 Thread GitBox



dongjoon-hyun commented on code in PR #37885:
URL: https://github.com/apache/spark/pull/37885#discussion_r973545013


##
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala:
##
@@ -971,18 +971,30 @@ private[spark] class TaskSchedulerImpl(
   }
 
   override def stop(): Unit = {
-speculationScheduler.shutdown()
+Utils.tryLogNonFatalError {
+  speculationScheduler.shutdown()
+}
 if (backend != null) {
-  backend.stop()
+  Utils.tryLogNonFatalError {
+backend.stop()
+  }
 }
 if (taskResultGetter != null) {
-  taskResultGetter.stop()
+  Utils.tryLogNonFatalError {
+taskResultGetter.stop()
+  }
 }
 if (barrierCoordinator != null) {
-  barrierCoordinator.stop()
+  Utils.tryLogNonFatalError {
+barrierCoordinator.stop()
+  }
+}
+Utils.tryLogNonFatalError {
+  starvationTimer.cancel()
+}
+Utils.tryLogNonFatalError {
+  abortTimer.cancel()

Review Comment:
   Does `Timer.cancel()` also require `Utils.tryLogNonFatalError`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #37916: [SPARK-40473][SQL] Migrate parsing errors onto error classes

2022-09-16 Thread GitBox



MaxGekk commented on PR #37916:
URL: https://github.com/apache/spark/pull/37916#issuecomment-1249570897

   > You seem to assume < 1000 of these. But just this one PR consumes close to 
a hundred slots"
   
   Some time ago, I have counted the total number of exceptions to be ported 
onto error classes around 800. I don't see any problems to start using 4 digits 
after 999.
   
   > How do we keep the numbering straight for the next n PRs?
   
   We have 3 types of errors, actually: parsing, compilation and execution. We 
could migrate them 1-by-1 using sequential numbers. Otherwise the PR will 
conflict a lot.
   
   >  I assume you used some tooling to whip this up so fast.
   
   I think it will be faster to manually port the exceptions without any 
tooling.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37905: [SPARK-40460][SS] Fix streaming metrics when selecting `_metadata`

2022-09-16 Thread GitBox



HeartSaVioR commented on code in PR #37905:
URL: https://github.com/apache/spark/pull/37905#discussion_r972646144


##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala:
##
@@ -524,10 +525,11 @@ class FileMetadataStructSuite extends QueryTest with 
SharedSparkSession {
 .select("*", "_metadata")
 .writeStream.format("json")
 .option("checkpointLocation", dir.getCanonicalPath + 
"/target/checkpoint")
+.trigger(Trigger.Once())

Review Comment:
   nit: Please change to `Trigger.AvailableNow()` and see whether this breaks 
the test or not. We are going to produce deprecated warning Trigger.Once() from 
Spark 3.4.0.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37625: [SPARK-40177][SQL] Simplify condition of form (a==b) || (a==null&==null) to a<=>b

2022-09-16 Thread GitBox



cloud-fan commented on code in PR #37625:
URL: https://github.com/apache/spark/pull/37625#discussion_r972659580


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala:
##
@@ -412,6 +412,16 @@ object BooleanSimplification extends Rule[LogicalPlan] 
with PredicateHelper {
   }
 }
 
+  case Or(EqualTo(l, r), And(IsNull(c1), IsNull(c2)))

Review Comment:
   Assume that we have a chain of predicates combined by OR `cond1 OR cond2 OR 
cond3 OR ... condN`. I think we can merge `condX` and `condY` if they are 
`EqualTo(l, r)` and `And(IsNull(l), isNull(r))`. This is more general than the 
current approach.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37906: [SPARK-40463][INFRA] Update gpg's keyserver

2022-09-16 Thread GitBox



dongjoon-hyun commented on code in PR #37906:
URL: https://github.com/apache/spark/pull/37906#discussion_r972668048


##
dev/create-release/spark-rm/Dockerfile:
##
@@ -53,7 +53,7 @@ ARG GEM_PKGS="bundler:2.2.9"
 # the most current package versions (instead of potentially using old versions 
cached by docker).
 RUN apt-get clean && apt-get update && $APT_INSTALL gnupg ca-certificates && \
   echo 'deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/' >> 
/etc/apt/sources.list && \
-  gpg --keyserver keyserver.ubuntu.com --recv-key 
E298A3A825C0D65DFD57CBB651716619E084DAB9 && \
+  gpg --keyserver hkps://keyserver.ubuntu.com --recv-key 
E298A3A825C0D65DFD57CBB651716619E084DAB9 && \

Review Comment:
   We have the same code in another place.
   
https://github.com/apache/spark/blob/32fbd7e83252f96df9c78f2f15f3917167821a12/dev/infra/Dockerfile#L52



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum commented on a diff in pull request #37906: [SPARK-40463][INFRA] Update gpg's keyserver

2022-09-16 Thread GitBox



wangyum commented on code in PR #37906:
URL: https://github.com/apache/spark/pull/37906#discussion_r972670730


##
dev/create-release/spark-rm/Dockerfile:
##
@@ -53,7 +53,7 @@ ARG GEM_PKGS="bundler:2.2.9"
 # the most current package versions (instead of potentially using old versions 
cached by docker).
 RUN apt-get clean && apt-get update && $APT_INSTALL gnupg ca-certificates && \
   echo 'deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/' >> 
/etc/apt/sources.list && \
-  gpg --keyserver keyserver.ubuntu.com --recv-key 
E298A3A825C0D65DFD57CBB651716619E084DAB9 && \
+  gpg --keyserver hkps://keyserver.ubuntu.com --recv-key 
E298A3A825C0D65DFD57CBB651716619E084DAB9 && \

Review Comment:
   OK.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] sadikovi opened a new pull request, #37911: [SPARK-40470] Handle GetArrayStructFields and GetMapValue in "arrays_zip" function

2022-09-16 Thread GitBox



sadikovi opened a new pull request, #37911:
URL: https://github.com/apache/spark/pull/37911

   
   
   ### What changes were proposed in this pull request?
   
   
   This is a follow-up for https://github.com/apache/spark/pull/37833.
   
   The PR fixes column names in `arrays_zip` function for the cases when 
`GetArrayStructFields` and `GetMapValue` expressions are used (see unit tests 
for more details).
   
   Before the patch, the column names would be indexes or an AnalysisException 
would be thrown in the case of `GetArrayStructFields` example.
   
   ### Why are the changes needed?
   
   
   Fixes an inconsistency issue in Spark 3.2 and onwards where the fields would 
be labeled as indexes instead of column names.
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   No.
   
   
   ### How was this patch tested?
   
   
   I added unit tests that reproduce the issue and confirmed that the patch 
fixes them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Yikun commented on pull request #37888: [SPARK-40196][PYTHON][PS] Consolidate `lit` function with NumPy scalar in sql and pandas module

2022-09-16 Thread GitBox



Yikun commented on PR #37888:
URL: https://github.com/apache/spark/pull/37888#issuecomment-1248987419

   ```
 test_repeat 
(pyspark.pandas.tests.test_spark_functions.SparkFunctionsTests) ... FAIL 
(0.052s)
   
   ==
   FAIL [0.052s]: test_repeat 
(pyspark.pandas.tests.test_spark_functions.SparkFunctionsTests)
   --
   Traceback (most recent call last):
 File 
"/__w/spark/spark/python/pyspark/pandas/tests/test_spark_functions.py", line 
28, in test_repeat
   self.assertTrue(spark_column_equals(SF.repeat(F.lit(1), 2), 
F.repeat(F.lit(1), 2)))
   AssertionError: False is not true
   
   --
   Ran 1 test in 8.471s
   ```
   
   Still failed due to `test_repeat`, Initial invistigation:
   ```
   F.repeat(F.lit(1), 2).__dict__
   Out[3]: {'_jc': JavaObject id=o50}
   SF.repeat(F.lit(1), 2).__dict__
   Out[4]: {'_jc': JavaObject id=o54}
   ```
   
   According to 
https://github.com/apache/spark/pull/37888#discussion_r971408190 , Looks like 
we need to remove this test?
   
   ```python
   class SparkFunctionsTests(PandasOnSparkTestCase):
   def test_repeat(self):
   # TODO: Placeholder
   pass
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #37888: [SPARK-40196][PYTHON][PS] Consolidate `lit` function with NumPy scalar in sql and pandas module

2022-09-16 Thread GitBox



zhengruifeng commented on PR #37888:
URL: https://github.com/apache/spark/pull/37888#issuecomment-1248990425

   @Yikun Let's comment this test for now, to unblock other PRs
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum commented on pull request #37906: [SPARK-40463][INFRA] Update gpg's keyserver

2022-09-16 Thread GitBox



wangyum commented on PR #37906:
URL: https://github.com/apache/spark/pull/37906#issuecomment-1248993445

   > Oh, @wangyum . It seems that you made an accidental commit on the `master` 
branch.
   > 
   > * 
[694cac6](https://github.com/apache/spark/commit/694cac63da3bfa651132eca9fee3278544616dc3)
   
   Sorry. I didn't which to my branch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Yikun opened a new pull request, #37912: [SPARK-40196][PYTHON][PS][FOLLOWUP] SparkFunctionsTests.test_repeat

2022-09-16 Thread GitBox



Yikun opened a new pull request, #37912:
URL: https://github.com/apache/spark/pull/37912

   ### What changes were proposed in this pull request?
   Mark `SparkFunctionsTests.test_repeat` as placeholder.
   
   
   ### Why are the changes needed?
   ```
 test_repeat 
(pyspark.pandas.tests.test_spark_functions.SparkFunctionsTests) ... FAIL 
(0.052s)
   
   ==
   FAIL [0.052s]: test_repeat 
(pyspark.pandas.tests.test_spark_functions.SparkFunctionsTests)
   --
   Traceback (most recent call last):
 File 
"/__w/spark/spark/python/pyspark/pandas/tests/test_spark_functions.py", line 
28, in test_repeat
   self.assertTrue(spark_column_equals(SF.repeat(F.lit(1), 2), 
F.repeat(F.lit(1), 2)))
   AssertionError: False is not true
   
   --
   Ran 1 test in 8.471s
   ```
   
   According to 
https://github.com/apache/spark/pull/37888#discussion_r971408190 we'd better 
skip it first.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #37906: [SPARK-40463][INFRA] Update gpg's keyserver

2022-09-16 Thread GitBox



dongjoon-hyun commented on PR #37906:
URL: https://github.com/apache/spark/pull/37906#issuecomment-1248995872

   Ya, the accident happens sometime. No worry. If CI succeeds, nobody gets 
hurt.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng closed pull request #37912: [SPARK-40196][PYTHON][PS][FOLLOWUP] Skip SparkFunctionsTests.test_repeat

2022-09-16 Thread GitBox



zhengruifeng closed pull request #37912: [SPARK-40196][PYTHON][PS][FOLLOWUP] 
Skip SparkFunctionsTests.test_repeat
URL: https://github.com/apache/spark/pull/37912


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum commented on pull request #37906: [SPARK-40463][INFRA] Update gpg's keyserver

2022-09-16 Thread GitBox



wangyum commented on PR #37906:
URL: https://github.com/apache/spark/pull/37906#issuecomment-1249000489

   Could we force push to overwrite that commit?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #37906: [SPARK-40463][INFRA] Update gpg's keyserver

2022-09-16 Thread GitBox



HyukjinKwon closed pull request #37906: [SPARK-40463][INFRA] Update gpg's 
keyserver
URL: https://github.com/apache/spark/pull/37906


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #37906: [SPARK-40463][INFRA] Update gpg's keyserver

2022-09-16 Thread GitBox



HyukjinKwon commented on PR #37906:
URL: https://github.com/apache/spark/pull/37906#issuecomment-1249003622

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #37906: [SPARK-40463][INFRA] Update gpg's keyserver

2022-09-16 Thread GitBox



dongjoon-hyun commented on PR #37906:
URL: https://github.com/apache/spark/pull/37906#issuecomment-1249003476

   Yep, reverting is possible. I'll leave it to you, @HyukjinKwon .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng opened a new pull request, #37913: [SPARK-40447][PS] Implement `kendall` correlation in `DataFrame.corr`

2022-09-16 Thread GitBox



zhengruifeng opened a new pull request, #37913:
URL: https://github.com/apache/spark/pull/37913

   ### What changes were proposed in this pull request?
   Implement `kendall` correlation in `DataFrame.corr`
   
   
   ### Why are the changes needed?
   for API coverage
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes, new correlation option:
   ```
   In [1]: import pyspark.pandas as ps
   
   In [2]: df = ps.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)], 
columns=['dogs', 'cats'])
   
   In [3]: df.corr('kendall')

   
 dogs  cats
   dogs  1.00 -0.912871
   cats -0.912871  1.00
   
   In [4]: df.to_pandas().corr('kendall')
   /Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/utils.py:975: 
PandasAPIOnSparkAdviceWarning: `to_pandas` loads all data into the driver's 
memory. It should only be used if the resulting pandas DataFrame is expected to 
be small.
 warnings.warn(message, PandasAPIOnSparkAdviceWarning)
   Out[4]: 
 dogs  cats
   dogs  1.00 -0.912871
   cats -0.912871  1.00
   ```
   
   
   ### How was this patch tested?
   added UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Yikun commented on a diff in pull request #37843: [SPARK-40398][CORE][SQL] Use Loop instead of Arrays.stream api

2022-09-16 Thread GitBox



Yikun commented on code in PR #37843:
URL: https://github.com/apache/spark/pull/37843#discussion_r972651836


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Expression.java:
##
@@ -44,7 +46,12 @@ public interface Expression {
* List of fields or columns that are referenced by this expression.
*/
   default NamedReference[] references() {
-return Arrays.stream(children()).map(e -> e.references())
-  .flatMap(Arrays::stream).distinct().toArray(NamedReference[]::new);
+// SPARK-40398: Replace `Arrays.stream()...distinct()`
+// to this for perf gain, the result order is not important.
+Set set = new HashSet<>();
+for (Expression e : children()) {
+  Collections.addAll(set, e.references());
+}
+return set.toArray(new NamedReference[0]);

Review Comment:
   
https://github.com/apache/spark/commit/254bd80278843b3bc13584ca2f04391a770a78c7



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum opened a new pull request, #37910: [SPARK-40469][CORE] Avoid creating directory failures

2022-09-16 Thread GitBox



wangyum opened a new pull request, #37910:
URL: https://github.com/apache/spark/pull/37910

   ### What changes were proposed in this pull request?
   
   This PR replace `Files.createDirectory` with `Files.createDirectories`.
   
   ### Why are the changes needed?
   
   To avoid creating directory failures if the parent directory removed by YARN:
   ```
   java.nio.file.NoSuchFileException: /hadoop/3/yarn/local/usercache//appcache/application_1654776504115_37917/blockmgr-e18b484f-8c49-4c7d-b649-710439b0e4c3/3c
at 
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at 
sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384)
at java.nio.file.Files.createDirectory(Files.java:674)
at 
org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:123)
at 
org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:146)
at org.apache.spark.storage.DiskStore.contains(DiskStore.scala:147)
at 
org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:853)
at 
org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$4(TorrentBroadcast.scala:253)
at scala.Option.getOrElse(Option.scala:189)
at 
org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$2(TorrentBroadcast.scala:250)
at org.apache.spark.util.KeyLock.withLock(KeyLock.scala:64)
at 
org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$1(TorrentBroadcast.scala:245)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1383)
at 
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:245)
at 
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:109)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:132)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:487)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1417)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:490)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Manually test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37906: [SPARK-40463][INFRA] Update gpg's keyserver

2022-09-16 Thread GitBox



dongjoon-hyun commented on code in PR #37906:
URL: https://github.com/apache/spark/pull/37906#discussion_r972669095


##
dev/create-release/spark-rm/Dockerfile:
##
@@ -53,7 +53,7 @@ ARG GEM_PKGS="bundler:2.2.9"
 # the most current package versions (instead of potentially using old versions 
cached by docker).
 RUN apt-get clean && apt-get update && $APT_INSTALL gnupg ca-certificates && \
   echo 'deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/' >> 
/etc/apt/sources.list && \
-  gpg --keyserver keyserver.ubuntu.com --recv-key 
E298A3A825C0D65DFD57CBB651716619E084DAB9 && \
+  gpg --keyserver hkps://keyserver.ubuntu.com --recv-key 
E298A3A825C0D65DFD57CBB651716619E084DAB9 && \

Review Comment:
   Both images are based on `Ubuntu 20.04` (Focal Fossa). If we need to change 
it, we had better consistent.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #37906: [SPARK-40463][INFRA] Update gpg's keyserver

2022-09-16 Thread GitBox



dongjoon-hyun commented on PR #37906:
URL: https://github.com/apache/spark/pull/37906#issuecomment-1248991651

   Oh, @wangyum . It seems that you made an accidental commit on the `master` 
branch.
   - 
https://github.com/apache/spark/commit/694cac63da3bfa651132eca9fee3278544616dc3


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #37906: [SPARK-40463][INFRA] Update gpg's keyserver

2022-09-16 Thread GitBox



dongjoon-hyun commented on PR #37906:
URL: https://github.com/apache/spark/pull/37906#issuecomment-1249001715

   No force-push, @wangyum . We already have another commit on top of yours.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #37912: [SPARK-40196][PYTHON][PS][FOLLOWUP] Skip SparkFunctionsTests.test_repeat

2022-09-16 Thread GitBox



dongjoon-hyun commented on PR #37912:
URL: https://github.com/apache/spark/pull/37912#issuecomment-1249001096

   +1 for the swift decision.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37905: [SPARK-40460][SS] Fix streaming metrics when selecting `_metadata`

2022-09-16 Thread GitBox



HeartSaVioR commented on code in PR #37905:
URL: https://github.com/apache/spark/pull/37905#discussion_r972722394


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala:
##
@@ -590,7 +591,7 @@ class MicroBatchExecution(
 val newBatchesPlan = logicalPlan transform {
   // For v1 sources.
   case StreamingExecutionRelation(source, output, catalogTable) =>
-newData.get(source).map { dataPlan =>
+mutableNewData.get(source).map { dataPlan =>
   val hasFileMetadata = output.exists {

Review Comment:
   This requires Source to indicate the request of metadata column and produce 
the logical plan accordingly when getBatch is called. My understanding is that 
DSv1 source does not have an interface to receive the information of which 
columns will be referred in actual query.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #37843: [SPARK-40398][CORE][SQL] Use Loop instead of Arrays.stream api

2022-09-16 Thread GitBox



LuciferYang commented on code in PR #37843:
URL: https://github.com/apache/spark/pull/37843#discussion_r972648962


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Expression.java:
##
@@ -44,7 +46,12 @@ public interface Expression {
* List of fields or columns that are referenced by this expression.
*/
   default NamedReference[] references() {
-return Arrays.stream(children()).map(e -> e.references())
-  .flatMap(Arrays::stream).distinct().toArray(NamedReference[]::new);
+// SPARK-40398: Replace `Arrays.stream()...distinct()`
+// to this for perf gain, the result order is not important.
+Set set = new HashSet<>();
+for (Expression e : children()) {
+  Collections.addAll(set, e.references());
+}
+return set.toArray(new NamedReference[0]);

Review Comment:
   Change to Python linter check failed...
   https://github.com/LuciferYang/spark/actions/runs/3065275580/jobs/4949221030
   
   
   ```
   starting mypy annotations test...
   annotations failed mypy checks:
   python/pyspark/pandas/window.py:112: error: Module has no attribute "lit"  
[attr-defined]
   Found 1 error in 1 file (checked 340 source files)
   1
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37905: [SPARK-40460][SS] Fix streaming metrics when selecting `_metadata`

2022-09-16 Thread GitBox



cloud-fan commented on code in PR #37905:
URL: https://github.com/apache/spark/pull/37905#discussion_r972655128


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala:
##
@@ -590,7 +591,7 @@ class MicroBatchExecution(
 val newBatchesPlan = logicalPlan transform {
   // For v1 sources.
   case StreamingExecutionRelation(source, output, catalogTable) =>
-newData.get(source).map { dataPlan =>
+mutableNewData.get(source).map { dataPlan =>
   val hasFileMetadata = output.exists {

Review Comment:
   looking at the code, seems the problem is we resolve the metadata columns in 
every micro-batch. Shouldn't we only resolve it once?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum commented on pull request #37906: [SPARK-40463][INFRA] Update gpg's keyserver

2022-09-16 Thread GitBox



wangyum commented on PR #37906:
URL: https://github.com/apache/spark/pull/37906#issuecomment-1249002542

   OK


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #37906: [SPARK-40463][INFRA] Update gpg's keyserver

2022-09-16 Thread GitBox



HyukjinKwon commented on PR #37906:
URL: https://github.com/apache/spark/pull/37906#issuecomment-1249002932

   Let me just revert and merge this PR in (just for the sake of trackability).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37905: [SPARK-40460][SS] Fix streaming metrics when selecting `_metadata`

2022-09-16 Thread GitBox



HeartSaVioR commented on code in PR #37905:
URL: https://github.com/apache/spark/pull/37905#discussion_r972711729


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala:
##
@@ -590,7 +591,7 @@ class MicroBatchExecution(
 val newBatchesPlan = logicalPlan transform {
   // For v1 sources.
   case StreamingExecutionRelation(source, output, catalogTable) =>
-newData.get(source).map { dataPlan =>
+mutableNewData.get(source).map { dataPlan =>

Review Comment:
   While we are here, probably less intrusive change would be moving (L594 ~ 
L610) to L567.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #37893: [SPARK-40434][SS][PYTHON] Implement applyInPandasWithState in PySpark

2022-09-16 Thread GitBox



HyukjinKwon commented on PR #37893:
URL: https://github.com/apache/spark/pull/37893#issuecomment-1249226867

   Will take a close look next Monday in KST.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #37710: [SPARK-40448][CONNECT] Spark Connect build as Driver Plugin with Shaded Dependencies

2022-09-16 Thread GitBox



HyukjinKwon commented on PR #37710:
URL: https://github.com/apache/spark/pull/37710#issuecomment-1249269885

   I am thinking about merging it without making major changes in this PR if 
there aren't major issues found, and I myself will take a look for 
important/urgent items very soon according to the plan described above.
   
   I would like to be transparent here. The frank reasons that I think as above 
are as follows:
   
   - Multiple people will intensively cowork together for this component but 
individual works in a different timezone which makes it difficult to work 
within this @martin-g's branch.
   - Difficult to manage the credibility. The whole size of work would be very 
huge, and I would like to avoid sharing the same credit with all the coauthors. 
Different person will sign off and be the author for individual change.
   - I would like to speed up by fully leveraging individual fork's GitHub 
Actions resources. Currently, @martin-g's GitHub resource here is a bottleneck.
   
   Hope this plan and thought make sense to other committers too. Are you guys 
okay with this? @dongjoon-hyun @viirya @mridulm @srowen @wangyum @sunchao 
@huaxingao (derived from SPIP voting)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #37743: [SPARK-40294][SQL] Fix repeat calls to `PartitionReader.hasNext` timing out

2022-09-16 Thread GitBox



cloud-fan commented on PR #37743:
URL: https://github.com/apache/spark/pull/37743#issuecomment-1249181442

   ping @richardc-db


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] EnricoMi commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-16 Thread GitBox



EnricoMi commented on code in PR #37407:
URL: https://github.com/apache/spark/pull/37407#discussion_r972869981


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -869,26 +873,55 @@ class Analyzer(override val catalogManager: 
CatalogManager)
 def apply(plan: LogicalPlan): LogicalPlan = 
plan.resolveOperatorsWithPruning(
   _.containsPattern(UNPIVOT), ruleId) {
 
-  // once children and ids are resolved, we can determine values, if non 
were given
-  case up: Unpivot if up.childrenResolved && up.ids.forall(_.resolved) && 
up.values.isEmpty =>
-up.copy(values = up.child.output.diff(up.ids))
+  // once children are resolved, we can determine values from ids and vice 
versa
+  // if only either is given
+  case up: Unpivot if up.childrenResolved &&
+up.ids.exists(_.forall(_.resolved)) && up.values.isEmpty =>
+up.copy(values =
+  Some(
+up.child.output.diff(up.ids.get.flatMap(_.references))

Review Comment:
   Scala and Python API both require ids. SQL API does not allow to specify 
ids, which is the same in other SQL databases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] EnricoMi commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-16 Thread GitBox



EnricoMi commented on code in PR #37407:
URL: https://github.com/apache/spark/pull/37407#discussion_r972871859


##
python/pyspark/sql/dataframe.py:
##
@@ -3064,7 +3064,7 @@ def cube(self, *cols: "ColumnOrName") -> "GroupedData":  
# type: ignore[misc]
 
 def unpivot(
 self,
-ids: Optional[Union["ColumnOrName", List["ColumnOrName"], 
Tuple["ColumnOrName", ...]]],
+ids: Union["ColumnOrName", List["ColumnOrName"], Tuple["ColumnOrName", 
...]],

Review Comment:
   However, `pyspark.pandas.frame.melt` allows for `None` for `ids`, having the 
meaning of `[]`, while `values` being `None` means magically take all non-id 
columns:
   
   ```python
   def melt(
   self,
   id_vars: Optional[Union[Name, List[Name]]] = None,
   value_vars: Optional[Union[Name, List[Name]]] = None,
   var_name: Optional[Union[str, List[str]]] = None,
   value_name: str = "value",
   ) -> "DataFrame":
   """
   ...
   Parameters
   --
   frame : DataFrame
   id_vars : tuple, list, or ndarray, optional
   Column(s) to use as identifier variables.
   value_vars : tuple, list, or ndarray, optional
   Column(s) to unpivot. If not specified, uses all columns that
   are not set as `id_vars`.
   ```
   
   Should Python API be consistent with Scala API or PySpark Pandas API?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] eejbyfeldt commented on a diff in pull request #37837: [SPARK-40385][SQL] Fix interpreted path for companion object constructor

2022-09-16 Thread GitBox



eejbyfeldt commented on code in PR #37837:
URL: https://github.com/apache/spark/pull/37837#discussion_r971579395


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ObjectExpressionsSuite.scala:
##
@@ -423,7 +423,7 @@ class ObjectExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
   inputTypes = Nil,
   propagateNull = false,
   dataType = ObjectType(classOf[ScroogeLikeExample]),
-  outerPointer = Some(() => outerObj))
+  outerPointer = None)

Review Comment:
   So normal inner classes (that uses outer pointer) are tested here in the 
ExpressionEncoderSuite: 
https://github.com/apache/spark/blob/ae08787f5c50e485ef4432a0c2da8b3b7290d725/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala#L229-L233
   and here in the ObjectExpressionsSuite: 
https://github.com/apache/spark/blob/ae08787f5c50e485ef4432a0c2da8b3b7290d725/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ObjectExpressionsSuite.scala#L408-L417
   so such cases are covered.
   
   Are you saying that we should add a test case for a class that only have an 
companion apply constructor? Trying to do something like that will not work in 
either this branch or master. This is because companion objects of inner 
classes are not singletons and the codegen will fail with that `"MODULE$" is 
neither a method, a field` because of this. Such a class would also behave 
slightly differently as the apply method constructor would not take an 
outerPointer. This is because the companion object already has an outer pointer 
and that will be used when creating the inner class object. Maybe it would be 
possible to add support for such cases but it would require more changes and is 
probably out of scope for this PR.
   
   But just to be clear *both* the test and the code was wrong before this PR 
and they were wrong in such a way were they cancelled out. And the new spec in 
ExpressionEncoderSuite in this PR that tests at a "higher level" shows also 
that the previous code was wrong as that test case will fail on master.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #37710: [SPARK-40448][CONNECT] Spark Connect build as Driver Plugin with Shaded Dependencies

2022-09-16 Thread GitBox



HyukjinKwon commented on PR #37710:
URL: https://github.com/apache/spark/pull/37710#issuecomment-1249261232

   This is ready for a look now.
   
   Since the whole feature and codes would be very large, we (explicitly I, 
@martin-g, @amaliujia, and @cloud-fan) discussed offline, and decided to 
propose to split this. This PR is basically the minimal working version  note 
that most of code lines here were generated from the protobuf.
   
   SPARK-39375 is a parent JIRA, and we described the current action items at 
this moment.
   More JIRAs will be filed accordingly to the plan below:
   
   ### High-level plan and design:
   
   - [High-Level Design Doc for Spark 
Connect](https://docs.google.com/document/d/17X6-P5H2522SnE-gF1BVwyildp_PDX8oXD-4l9vqQmA/edit?usp=sharing)
   - [Spark Connect API Testing 
Plan](https://docs.google.com/document/d/1n6EgS5vcmbwJUs5KGX4PzjKZVcSKd0qf0gLNZ6NFvOE/edit?usp=sharing)
   
   ### Low-level plan:
   
   **Short-term**
   - Extend test coverage for SparkConnectPlanner (right now at 76% line 
coverage)
   - Extend test coverage for Spark Connect Python client
   - Type annotations for Spark Connect Python client to re-enable mypy
   - Clean-up documentation in PySpark code for Spark Connect
   - Documentation for PySpark in README and doctests
   - Proto validation in server and/or client
   - Validation: 
 - Syntactic -> Parsing
 - Semantic -> Analysis 
   - Alternatively only return error class to clients upon failures.
   - Initial DSL framework for protobuf testing
   - Restructure the build structure to match with other components
 - Maven
 - SBT 
   
   **Long-term**
   - Testing with custom DSL 
   - `LocalRelation`
   - Better error handling for semantic failures
   - Spark and Session configurations
   - Scala Client
   - SBT incremental build and testing environment
   - DataSources
   - UDFs
   - Packaging / Releasing
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang opened a new pull request, #37914: [SPARK-40471][BUILD] Upgrade RoaringBitmap to 0.9.32

2022-09-16 Thread GitBox



LuciferYang opened a new pull request, #37914:
URL: https://github.com/apache/spark/pull/37914

   ### What changes were proposed in this pull request?
   This pr aims upgrade RoaringBitmap 0.9.32
   
   
   
   
   ### Why are the changes needed?
   This is a bug fix version:
   
   - https://github.com/RoaringBitmap/RoaringBitmap/issues/575
   - https://github.com/RoaringBitmap/RoaringBitmap/pull/578
   
   other changes as follows:
   
   - https://github.com/RoaringBitmap/RoaringBitmap/compare/0.9.31...0.9.32
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Pass GitHub Actions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #37914: [SPARK-40471][BUILD] Upgrade RoaringBitmap to 0.9.32

2022-09-16 Thread GitBox



LuciferYang commented on PR #37914:
URL: https://github.com/apache/spark/pull/37914#issuecomment-1249044491

   will check MapStatusesConvertBenchmark result later


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-16 Thread GitBox



cloud-fan commented on code in PR #37407:
URL: https://github.com/apache/spark/pull/37407#discussion_r972808194


##
python/pyspark/sql/dataframe.py:
##
@@ -3091,12 +3098,12 @@ def unpivot(
 
 Parameters
 --
-ids : str, Column, tuple, list, optional
+ids : str, Column, tuple, list
 Column(s) to use as identifiers. Can be a single column or column 
name,
 or a list or tuple for multiple columns.
 values : str, Column, tuple, list, optional
 Column(s) to unpivot. Can be a single column or column name, or a 
list or tuple
-for multiple columns. If not specified or empty, uses all columns 
that
+for multiple columns. Must not be empty. If None, uses all columns 
that

Review Comment:
   ```suggestion
   for multiple columns. If specified, must not be empty. If not 
specified, uses all columns that
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-16 Thread GitBox



cloud-fan commented on code in PR #37407:
URL: https://github.com/apache/spark/pull/37407#discussion_r972833735


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -869,26 +873,55 @@ class Analyzer(override val catalogManager: 
CatalogManager)
 def apply(plan: LogicalPlan): LogicalPlan = 
plan.resolveOperatorsWithPruning(
   _.containsPattern(UNPIVOT), ruleId) {
 
-  // once children and ids are resolved, we can determine values, if non 
were given
-  case up: Unpivot if up.childrenResolved && up.ids.forall(_.resolved) && 
up.values.isEmpty =>
-up.copy(values = up.child.output.diff(up.ids))
+  // once children are resolved, we can determine values from ids and vice 
versa
+  // if only either is given
+  case up: Unpivot if up.childrenResolved &&
+up.ids.exists(_.forall(_.resolved)) && up.values.isEmpty =>
+up.copy(values =
+  Some(
+up.child.output.diff(up.ids.get.flatMap(_.references))

Review Comment:
   Oh I see, it's to match the behavior of when id columns are not specified. 
It's better to make them consistent, but it's also important to match both 
Pandas behavior and SQL behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR closed pull request #37907: [SPARK-40467][SS] Split FlatMapGroupsWithState down to multiple test suites

2022-09-16 Thread GitBox



HeartSaVioR closed pull request #37907: [SPARK-40467][SS] Split 
FlatMapGroupsWithState down to multiple test suites
URL: https://github.com/apache/spark/pull/37907


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] EnricoMi commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-16 Thread GitBox



EnricoMi commented on code in PR #37407:
URL: https://github.com/apache/spark/pull/37407#discussion_r972844049


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -869,26 +873,55 @@ class Analyzer(override val catalogManager: 
CatalogManager)
 def apply(plan: LogicalPlan): LogicalPlan = 
plan.resolveOperatorsWithPruning(
   _.containsPattern(UNPIVOT), ruleId) {
 
-  // once children and ids are resolved, we can determine values, if non 
were given
-  case up: Unpivot if up.childrenResolved && up.ids.forall(_.resolved) && 
up.values.isEmpty =>
-up.copy(values = up.child.output.diff(up.ids))
+  // once children are resolved, we can determine values from ids and vice 
versa
+  // if only either is given
+  case up: Unpivot if up.childrenResolved &&
+up.ids.exists(_.forall(_.resolved)) && up.values.isEmpty =>
+up.copy(values =
+  Some(
+up.child.output.diff(up.ids.get.flatMap(_.references))

Review Comment:
   Yes, this implements `.unpivot(Array, String, String)`, when no value are 
given, similar to when no ids are given in SQL.
   
   This is the implementation of
   ```
  * Note: A column that is referenced by an id column expression is 
considered an id column itself.
  * For instance `$"id" * 2` references column `id`, so `id` is considered 
an id column and not a
  * value column
   ```
   
   I thought, the following would be annoying:
   ```
   scala> val df = spark.range(5).select($"id", ($"id"*10).as("val"))
   scala> df.show()
   +---+---+
   | id|val|
   +---+---+
   |  0|  0|
   |  1| 10|
   |  2| 20|
   |  3| 30|
   |  4| 40|
   +---+---+
   
   df.unpivot(Array($"id" * 2), "col", "val").show()
   ++---+---+
   |(id * 2)|col|val|
   ++---+---+
   |   0|id|  0|
   |   0|val|  0|
   |   2|id| 1|
   |   2|val| 10|
   |   4|id 2|
   |   4|val| 20|
   |   6|id 3|
   |   6|val| 30|
   |   8|id 4|
   |   8|val| 40|
   ++---+---+
   ```
   
   Of course, that id manipulation can be done before `unpivot` to 
"materialize" it as a reference:
   ```
   df.withColumn("id", $"id" * 2).unpivot(Array($"id"), "col", "val").show()
   +---+---+---+
   | id|col|val|
   +---+---+---+
   |  0|val|  0|
   |  2|val| 10|
   |  4|val| 20|
   |  6|val| 30|
   |  8|val| 40|
   +---+---+---+
   ```
   
   Happy to remove that complexity.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-16 Thread GitBox



cloud-fan commented on code in PR #37407:
URL: https://github.com/apache/spark/pull/37407#discussion_r972860753


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:
##
@@ -1374,32 +1374,104 @@ case class Pivot(
   override protected def withNewChildInternal(newChild: LogicalPlan): Pivot = 
copy(child = newChild)
 }
 
+/**
+ * Expression for [[Unpivot]] for one unpivot value column (one or more 
expressions)
+ * and an optional alias. This node itself is not evaluable and resolvable.
+ * Only its children are to be resolved.
+ *
+ * @param exprs expressions to unpivot
+ * @param alias optional alias
+ */
+case class UnpivotExpr(exprs: Seq[NamedExpression], alias: Option[String]) 
extends Unevaluable {
+  override val children: Seq[NamedExpression] = exprs
+  override def dataType: DataType = throw new UnresolvedException("dataType")
+  override def nullable: Boolean = throw new UnresolvedException("nullable")
+  // override lazy val resolved = false
+
+  override protected def withNewChildrenInternal(
+  newChildren: IndexedSeq[Expression]): Expression = {
+// turn expressions into named expressions
+copy(exprs = newChildren.map {
+  case ne: NamedExpression => ne
+  case e: Expression => UnresolvedAlias(e)
+})
+  }
+}
+
 /**
  * A constructor for creating an Unpivot, which will later be converted to an 
[[Expand]]
  * during the query analysis.
  *
- * An empty values array will be replaced during analysis with all resolved 
outputs of child except
+ * Either ids or values array must be set. The ids array can be empty,
+ * the values array must not be empty if not None.
+ *
+ * A None ids array will be replaced during analysis with all resolved outputs 
of child except
+ * the values. This expansion allows to easily select all non-value columns as 
id columns.
+ *
+ * A None values array will be replaced during analysis with all resolved 
outputs of child except
  * the ids. This expansion allows to easily unpivot all non-id columns.
  *
  * @see `org.apache.spark.sql.catalyst.analysis.Analyzer.ResolveUnpivot`
  *
- * The type of the value column is derived from all value columns during 
analysis once all values
- * are resolved. All values' types have to be compatible, otherwise the result 
value column cannot
- * be assigned the individual values and an AnalysisException is thrown.
+ * Multiple columns can be unpivoted in one row by providing multiple value 
column names
+ * and the same number of unpivot value expressions:
+ * {{{
+ *   // one-dimensional value columns
+ *   Unpivot(
+ * Some(Seq("id")),
+ * Some(Seq(
+ *   (Seq("val1"), None),
+ *   (Seq("val2"), None)
+ * )),
+ * "var",
+ * Seq("val")
+ *   )
+ *
+ *   // two-dimensional value columns
+ *   Unpivot(
+ * Some(Seq("id")),
+ * Some(Seq(
+ *   (Seq("val1.1", "val1.2"), None),
+ *   (Seq("val2.1", "val2.2"), None)
+ * )),
+ * "var",
+ * Seq("val1", "val2")
+ *   )
+ * }}}
+ *
+ * The variable column will contain the name of the unpivot value while the 
value columns contain
+ * the unpivot values. Multi-dimensional unpivot values can be given `aliases`:
+ * }}}
+ *   // two-dimensional value columns with aliases
+ *   Unpivot(
+ * Some(Seq("id")),
+ * Some(Seq(
+ *   (Seq("val1.1", "val1.2"), Some("val1")),
+ *   (Seq("val2.1", "val2.2"), Some("val2"))
+ * )),
+ * "var",
+ * Seq("val1", "val2")
+ *   )
+ * }}}
+ *
+ * All "value" columns must share a least common data type. Unless they are 
the same data type,
+ * all "value" columns are cast to the nearest common data type. For instance,
+ * types `IntegerType` and `LongType` are cast to `LongType`, while 
`IntegerType` and `StringType`
+ * do not have a common data type and `unpivot` fails with an 
`AnalysisException`.
  *
  * @see 
`org.apache.spark.sql.catalyst.analysis.TypeCoercionBase.UnpivotCoercion`
  *
  * @param idsId columns
- * @param values Value columns to unpivot
+ * @param values Value column sets to unpivot with optional aliases
  * @param variableColumnName Name of the variable column
- * @param valueColumnNameName of the value column
+ * @param valueColumnNames   Names of the value columns
  * @param child  Child operator
  */
 case class Unpivot(
-ids: Seq[NamedExpression],
-values: Seq[NamedExpression],
+ids: Option[Seq[NamedExpression]],
+values: Option[Seq[UnpivotExpr]],
 variableColumnName: String,
-valueColumnName: String,
+valueColumnNames: Seq[String],
 child: LogicalPlan) extends UnaryNode {
   override lazy val resolved = false  // Unpivot will be replaced after being 
resolved.

Review Comment:
   let's add `assert(ids.isDefined || values.isDefined)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to

[GitHub] [spark] cloud-fan commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-16 Thread GitBox



cloud-fan commented on code in PR #37407:
URL: https://github.com/apache/spark/pull/37407#discussion_r972862941


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -869,26 +873,55 @@ class Analyzer(override val catalogManager: 
CatalogManager)
 def apply(plan: LogicalPlan): LogicalPlan = 
plan.resolveOperatorsWithPruning(
   _.containsPattern(UNPIVOT), ruleId) {
 
-  // once children and ids are resolved, we can determine values, if non 
were given
-  case up: Unpivot if up.childrenResolved && up.ids.forall(_.resolved) && 
up.values.isEmpty =>
-up.copy(values = up.child.output.diff(up.ids))
+  // once children are resolved, we can determine values from ids and vice 
versa
+  // if only either is given
+  case up: Unpivot if up.childrenResolved &&
+up.ids.exists(_.forall(_.resolved)) && up.values.isEmpty =>
+up.copy(values =
+  Some(
+up.child.output.diff(up.ids.get.flatMap(_.references))

Review Comment:
   For DataFrame API, we'd better follow pandas even if it's annoying... SQL 
API is a different story. Do other databases have the same behavior as this PR 
does when id columns are not specified?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37905: [SPARK-40460][SS] Fix streaming metrics when selecting `_metadata`

2022-09-16 Thread GitBox



HeartSaVioR commented on code in PR #37905:
URL: https://github.com/apache/spark/pull/37905#discussion_r972711729


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala:
##
@@ -590,7 +591,7 @@ class MicroBatchExecution(
 val newBatchesPlan = logicalPlan transform {
   // For v1 sources.
   case StreamingExecutionRelation(source, output, catalogTable) =>
-newData.get(source).map { dataPlan =>
+mutableNewData.get(source).map { dataPlan =>

Review Comment:
   While we are here, probably less intrusive change would be moving (L594 ~ 
L610) to L567. After the change we wouldn't need to make a change to newData 
here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #37905: [SPARK-40460][SS] Fix streaming metrics when selecting `_metadata`

2022-09-16 Thread GitBox



HeartSaVioR commented on code in PR #37905:
URL: https://github.com/apache/spark/pull/37905#discussion_r972722394


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala:
##
@@ -590,7 +591,7 @@ class MicroBatchExecution(
 val newBatchesPlan = logicalPlan transform {
   // For v1 sources.
   case StreamingExecutionRelation(source, output, catalogTable) =>
-newData.get(source).map { dataPlan =>
+mutableNewData.get(source).map { dataPlan =>
   val hasFileMetadata = output.exists {

Review Comment:
   It will require Source to indicate the request of metadata column and 
produce the logical plan accordingly when getBatch is called. My understanding 
is that DSv1 source does not have an interface to receive the information of 
which columns will be referred in actual query.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] EnricoMi commented on a diff in pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

2022-09-16 Thread GitBox



EnricoMi commented on code in PR #37407:
URL: https://github.com/apache/spark/pull/37407#discussion_r972790708


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala:
##
@@ -208,6 +208,7 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]]
 }
 
 def recursiveTransform(arg: Any): AnyRef = arg match {
+  case ue: UnpivotExpr => 
ue.withNewChildren(ue.exprs.map(transformExpression))

Review Comment:
   This is a very special treatment in a very generic method. Not nice, but the 
best way I could find to make `mapExpressions` tranform `UnpivotExpr.exprs` as 
a top-level expression. See comment on top-level expression in PR discussion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 123 matches

Mail list logo