[jira] [Resolved] (SPARK-42908) Raise RuntimeError if SparkContext is not initialized when parsing DDL-formatted type strings

2023-03-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42908.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 40534
[https://github.com/apache/spark/pull/40534]

> Raise RuntimeError if SparkContext is not initialized when parsing 
> DDL-formatted type strings
> -
>
> Key: SPARK-42908
> URL: https://issues.apache.org/jira/browse/SPARK-42908
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Raise RuntimeError if SparkContext is not initialized when parsing 
> DDL-formatted type strings.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42908) Raise RuntimeError if SparkContext is not initialized when parsing DDL-formatted type strings

2023-03-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42908:


Assignee: Xinrong Meng

> Raise RuntimeError if SparkContext is not initialized when parsing 
> DDL-formatted type strings
> -
>
> Key: SPARK-42908
> URL: https://issues.apache.org/jira/browse/SPARK-42908
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> Raise RuntimeError if SparkContext is not initialized when parsing 
> DDL-formatted type strings.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37677) spark on k8s, when the user want to push python3.6.6.zip to the pod , but no permission to execute

2023-03-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37677:


Assignee: jingxiong zhong

> spark on k8s, when the user want to push python3.6.6.zip to the pod , but no 
> permission to execute
> --
>
> Key: SPARK-37677
> URL: https://issues.apache.org/jira/browse/SPARK-37677
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: jingxiong zhong
>Assignee: jingxiong zhong
>Priority: Major
>
> In cluster mode, I hava another question that when I unzip python3.6.6.zip in 
> pod , but no permission to execute, my execute operation as follows:
> {code:sh}
> spark-submit \
> --archives ./python3.6.6.zip#python3.6.6 \
> --conf "spark.pyspark.python=python3.6.6/python3.6.6/bin/python3" \
> --conf "spark.pyspark.driver.python=python3.6.6/python3.6.6/bin/python3" \
> --conf spark.kubernetes.container.image.pullPolicy=Always \
> ./examples/src/main/python/pi.py 100
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37677) spark on k8s, when the user want to push python3.6.6.zip to the pod , but no permission to execute

2023-03-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37677.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40572
[https://github.com/apache/spark/pull/40572]

> spark on k8s, when the user want to push python3.6.6.zip to the pod , but no 
> permission to execute
> --
>
> Key: SPARK-37677
> URL: https://issues.apache.org/jira/browse/SPARK-37677
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: jingxiong zhong
>Assignee: jingxiong zhong
>Priority: Major
> Fix For: 3.5.0
>
>
> In cluster mode, I hava another question that when I unzip python3.6.6.zip in 
> pod , but no permission to execute, my execute operation as follows:
> {code:sh}
> spark-submit \
> --archives ./python3.6.6.zip#python3.6.6 \
> --conf "spark.pyspark.python=python3.6.6/python3.6.6/bin/python3" \
> --conf "spark.pyspark.driver.python=python3.6.6/python3.6.6/bin/python3" \
> --conf spark.kubernetes.container.image.pullPolicy=Always \
> ./examples/src/main/python/pi.py 100
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42944) Support foreachBatch() in streaming spark connect

2023-03-27 Thread Raghu Angadi (Jira)
Raghu Angadi created SPARK-42944:


 Summary: Support foreachBatch() in streaming spark connect
 Key: SPARK-42944
 URL: https://issues.apache.org/jira/browse/SPARK-42944
 Project: Spark
  Issue Type: Task
  Components: Connect, Structured Streaming
Affects Versions: 3.5.0
Reporter: Raghu Angadi


Add support for foreachBatch() streaming spark connect. This might need deep 
dive into various complexities of arbitrary spark code since foreachBatch 
block. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42943) Use LONGTEXT instead of TEXT for StringType

2023-03-27 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705791#comment-17705791
 ] 

Snoot.io commented on SPARK-42943:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/40573

> Use LONGTEXT instead of TEXT for StringType
> ---
>
> Key: SPARK-42943
> URL: https://issues.apache.org/jira/browse/SPARK-42943
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kent Yao
>Priority: Major
>
> MysqlDataTruncation will be thrown if the string length exceeds 65535



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42937) Join with subquery in condition can fail with wholestage codegen and adaptive execution disabled

2023-03-27 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705790#comment-17705790
 ] 

Snoot.io commented on SPARK-42937:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/40569

> Join with subquery in condition can fail with wholestage codegen and adaptive 
> execution disabled
> 
>
> Key: SPARK-42937
> URL: https://issues.apache.org/jira/browse/SPARK-42937
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The below left outer join gets an error:
> {noformat}
> create or replace temp view v1 as
> select * from values
> (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
> (2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
> (3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
> as v1(key, value1, value2, value3, value4, value5, value6, value7, value8, 
> value9, value10);
> create or replace temp view v2 as
> select * from values
> (1, 2),
> (3, 8),
> (7, 9)
> as v2(a, b);
> create or replace temp view v3 as
> select * from values
> (3),
> (8)
> as v3(col1);
> set spark.sql.codegen.maxFields=10; -- let's make maxFields 10 instead of 100
> set spark.sql.adaptive.enabled=false;
> select *
> from v1
> left outer join v2
> on key = a
> and key in (select col1 from v3);
> {noformat}
> The join fails during predicate codegen:
> {noformat}
> 23/03/27 12:24:12 WARN Predicate: Expr codegen error and falling back to 
> interpreter mode
> java.lang.IllegalArgumentException: requirement failed: input[0, int, false] 
> IN subquery#34 has not finished
>   at scala.Predef$.require(Predef.scala:281)
>   at 
> org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144)
>   at 
> org.apache.spark.sql.execution.InSubqueryExec.doGenCode(subquery.scala:156)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:201)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:196)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.$anonfun$generateExpressions$2(CodeGenerator.scala:1278)
>   at scala.collection.immutable.List.map(List.scala:293)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.generateExpressions(CodeGenerator.scala:1278)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:41)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:33)
>   at 
> org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:73)
>   at 
> org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:70)
>   at 
> org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.Predicate$.create(predicates.scala:86)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.boundCondition(HashJoin.scala:146)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.boundCondition$(HashJoin.scala:140)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition$lzycompute(BroadcastHashJoinExec.scala:40)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition(BroadcastHashJoinExec.scala:40)
> {noformat}
> It fails again after fallback to interpreter mode:
> {noformat}
> 23/03/27 12:24:12 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
> java.lang.IllegalArgumentException: requirement failed: input[0, int, false] 
> IN subquery#34 has not finished
>   at scala.Predef$.require(Predef.scala:281)
>   at 
> org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144)
>   at 
> org.apache.spark.sql.execution.InSubqueryExec.eval(subquery.scala:151)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:52)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2(HashJoin.scala:146)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2$adapted(HashJoin.scala:146)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.$anonfun$outerJoin$1(HashJoin.scala:205)
> {noformat}
> Both the predicate codegen and the evaluation fail for the same reason: 
> {{PlanSubqueries}} creates {{InSubqueryExec}} with {{shouldBroadcast=false}}. 
> The driver waits for the subquery to finish, but it's the executor that uses 
> the 

[jira] [Commented] (SPARK-41233) High-order function: array_prepend

2023-03-27 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705789#comment-17705789
 ] 

Snoot.io commented on SPARK-41233:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/40563

> High-order function: array_prepend
> --
>
> Key: SPARK-41233
> URL: https://issues.apache.org/jira/browse/SPARK-41233
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.5.0
>
>
> refer to 
> https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_prepend.html
> 1, about the data type validation:
> In Snowflake’s array_append, array_prepend and array_insert functions, the 
> element data type does not need to match the data type of the existing 
> elements in the array.
> While in Spark, we want to leverage the same data type validation as 
> array_remove.
> 2, about the NULL handling
> Currently, SparkSQL, SnowSQL and PostgreSQL deal with NULL values in 
> different ways.
> Existing functions array_contains, array_position and array_remove in 
> SparkSQL handle NULL in this way, if the input array or/and element is NULL, 
> returns NULL. However, this behavior should be broken.
> We should implement the NULL handling in array_prepend in this way:
> 2.1, if the array is NULL, returns NULL;
> 2.2 if the array is not NULL, the element is NULL, append the NULL value into 
> the array



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42922) Use SecureRandom, instead of Random in security sensitive contexts

2023-03-27 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-42922.
--
Fix Version/s: 3.3.3
   3.4.1
   3.5.0
 Assignee: Mridul Muralidharan
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/40568

> Use SecureRandom, instead of Random in security sensitive contexts
> --
>
> Key: SPARK-42922
> URL: https://issues.apache.org/jira/browse/SPARK-42922
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.3, 3.3.2, 3.4.0, 3.5.0
>Reporter: Mridul Muralidharan
>Assignee: Mridul Muralidharan
>Priority: Minor
> Fix For: 3.3.3, 3.4.1, 3.5.0
>
>
> Most uses of Random in spark are either in test cases or where we need a 
> pseudo random number which is repeatable.
> The following are usages where moving from Random to SecureRandom would be 
> useful
> a) HttpAuthUtils.createCookieToken
> b) ThriftHttpServlet.RAN



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42922) Use SecureRandom, instead of Random in security sensitive contexts

2023-03-27 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-42922:
-
Priority: Minor  (was: Major)

> Use SecureRandom, instead of Random in security sensitive contexts
> --
>
> Key: SPARK-42922
> URL: https://issues.apache.org/jira/browse/SPARK-42922
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.3, 3.3.2, 3.4.0, 3.5.0
>Reporter: Mridul Muralidharan
>Priority: Minor
>
> Most uses of Random in spark are either in test cases or where we need a 
> pseudo random number which is repeatable.
> The following are usages where moving from Random to SecureRandom would be 
> useful
> a) HttpAuthUtils.createCookieToken
> b) ThriftHttpServlet.RAN



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41876) Implement DataFrame `toLocalIterator`

2023-03-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41876:


Assignee: Takuya Ueshin

> Implement DataFrame `toLocalIterator`
> -
>
> Key: SPARK-41876
> URL: https://issues.apache.org/jira/browse/SPARK-41876
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Takuya Ueshin
>Priority: Major
>
> {code:java}
> schema = StructType(
> [StructField("i", StringType(), True), StructField("j", IntegerType(), 
> True)]
> )
> df = self.spark.createDataFrame([("a", 1)], schema)
> schema1 = StructType([StructField("j", StringType()), StructField("i", 
> StringType())])
> df1 = df.to(schema1)
> self.assertEqual(schema1, df1.schema)
> self.assertEqual(df.count(), df1.count())
> schema2 = StructType([StructField("j", LongType())])
> df2 = df.to(schema2)
> self.assertEqual(schema2, df2.schema)
> self.assertEqual(df.count(), df2.count())
> schema3 = StructType([StructField("struct", schema1, False)])
> df3 = df.select(struct("i", "j").alias("struct")).to(schema3)
> self.assertEqual(schema3, df3.schema)
> self.assertEqual(df.count(), df3.count())
> # incompatible field nullability
> schema4 = StructType([StructField("j", LongType(), False)])
> self.assertRaisesRegex(
> AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4)
> ){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 1486, in test_to
>     self.assertRaisesRegex(
> AssertionError: AnalysisException not raised by  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41876) Implement DataFrame `toLocalIterator`

2023-03-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41876.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 40570
[https://github.com/apache/spark/pull/40570]

> Implement DataFrame `toLocalIterator`
> -
>
> Key: SPARK-41876
> URL: https://issues.apache.org/jira/browse/SPARK-41876
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.4.0
>
>
> {code:java}
> schema = StructType(
> [StructField("i", StringType(), True), StructField("j", IntegerType(), 
> True)]
> )
> df = self.spark.createDataFrame([("a", 1)], schema)
> schema1 = StructType([StructField("j", StringType()), StructField("i", 
> StringType())])
> df1 = df.to(schema1)
> self.assertEqual(schema1, df1.schema)
> self.assertEqual(df.count(), df1.count())
> schema2 = StructType([StructField("j", LongType())])
> df2 = df.to(schema2)
> self.assertEqual(schema2, df2.schema)
> self.assertEqual(df.count(), df2.count())
> schema3 = StructType([StructField("struct", schema1, False)])
> df3 = df.select(struct("i", "j").alias("struct")).to(schema3)
> self.assertEqual(schema3, df3.schema)
> self.assertEqual(df.count(), df3.count())
> # incompatible field nullability
> schema4 = StructType([StructField("j", LongType(), False)])
> self.assertRaisesRegex(
> AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4)
> ){code}
> {code:java}
> Traceback (most recent call last):
>   File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py",
>  line 1486, in test_to
>     self.assertRaisesRegex(
> AssertionError: AnalysisException not raised by  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42895) ValueError when invoking any session operations on a stopped Spark session

2023-03-27 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705782#comment-17705782
 ] 

Snoot.io commented on SPARK-42895:
--

User 'allisonwang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/40536

> ValueError when invoking any session operations on a stopped Spark session
> --
>
> Key: SPARK-42895
> URL: https://issues.apache.org/jira/browse/SPARK-42895
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Priority: Major
>
> If a remote Spark session is stopped, trying to invoke any session operations 
> will result in a ValueError. For example:
>  
> {code:java}
> spark.stop()
> spark.sql("select 1")
> ValueError: Cannot invoke RPC: Channel closed!
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   ...
>     return e.code() == grpc.StatusCode.UNAVAILABLE
> AttributeError: 'ValueError' object has no attribute 'code'{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42943) Use LONGTEXT instead of TEXT for StringType

2023-03-27 Thread Kent Yao (Jira)
Kent Yao created SPARK-42943:


 Summary: Use LONGTEXT instead of TEXT for StringType
 Key: SPARK-42943
 URL: https://issues.apache.org/jira/browse/SPARK-42943
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Kent Yao


MysqlDataTruncation will be thrown if the string length exceeds 65535



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37677) spark on k8s, when the user want to push python3.6.6.zip to the pod , but no permission to execute

2023-03-27 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705781#comment-17705781
 ] 

Snoot.io commented on SPARK-37677:
--

User 'smallzhongfeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/40572

> spark on k8s, when the user want to push python3.6.6.zip to the pod , but no 
> permission to execute
> --
>
> Key: SPARK-37677
> URL: https://issues.apache.org/jira/browse/SPARK-37677
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: jingxiong zhong
>Priority: Major
>
> In cluster mode, I hava another question that when I unzip python3.6.6.zip in 
> pod , but no permission to execute, my execute operation as follows:
> {code:sh}
> spark-submit \
> --archives ./python3.6.6.zip#python3.6.6 \
> --conf "spark.pyspark.python=python3.6.6/python3.6.6/bin/python3" \
> --conf "spark.pyspark.driver.python=python3.6.6/python3.6.6/bin/python3" \
> --conf spark.kubernetes.container.image.pullPolicy=Always \
> ./examples/src/main/python/pi.py 100
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38200) [SQL] Spark JDBC Savemode Supports Upsert

2023-03-27 Thread melin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

melin updated SPARK-38200:
--
Description: 
upsert sql for different databases, Most databases support merge sql:

sqlserver merge into sql : 
[https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/sqlserver/SqlServerDialect.java]

mysql: 
[https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/mysql/MysqlDialect.java]

oracle merge into sql : 
[https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/oracle/OracleDialect.java]

postgres: 
[https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/psql/PostgresDialect.java]

postgres merg into sql : 
[https://www.postgresql.org/docs/current/sql-merge.html]

db2 merge into sql : 
[https://www.ibm.com/docs/en/db2-for-zos/12?topic=statements-merge]

derby merge into sql: 
[https://db.apache.org/derby/docs/10.14/ref/rrefsqljmerge.html]

he merg into sql : 
[https://www.tutorialspoint.com/h2_database/h2_database_merge.htm]

[~beliefer] [~cloud_fan] 

  was:
When writing data into a relational database, data duplication needs to be 
considered. Both mysql and postgres support upsert syntax.

mysql:
{code:java}
replace into t(id, update_time) values(1, now()); {code}
pg:
{code:java}
INSERT INTO %s (id,name,data_time,remark) VALUES ( ?,?,?,? ) ON CONFLICT 
(id,name) DO UPDATE SET 
id=excluded.id,name=excluded.name,data_time=excluded.data_time,remark=excluded.remark
   {code}


> [SQL] Spark JDBC Savemode Supports Upsert
> -
>
> Key: SPARK-38200
> URL: https://issues.apache.org/jira/browse/SPARK-38200
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: melin
>Priority: Major
>
> upsert sql for different databases, Most databases support merge sql:
> sqlserver merge into sql : 
> [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/sqlserver/SqlServerDialect.java]
> mysql: 
> [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/mysql/MysqlDialect.java]
> oracle merge into sql : 
> [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/oracle/OracleDialect.java]
> postgres: 
> [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/psql/PostgresDialect.java]
> postgres merg into sql : 
> [https://www.postgresql.org/docs/current/sql-merge.html]
> db2 merge into sql : 
> [https://www.ibm.com/docs/en/db2-for-zos/12?topic=statements-merge]
> derby merge into sql: 
> [https://db.apache.org/derby/docs/10.14/ref/rrefsqljmerge.html]
> he merg into sql : 
> [https://www.tutorialspoint.com/h2_database/h2_database_merge.htm]
> [~beliefer] [~cloud_fan] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-38200) [SQL] Spark JDBC Savemode Supports Upsert

2023-03-27 Thread melin (Jira)


[ https://issues.apache.org/jira/browse/SPARK-38200 ]


melin deleted comment on SPARK-38200:
---

was (Author: melin):
upsert sql for different databases, Most databases support merge sql:

sqlserver merge into sql : 
[https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/sqlserver/SqlServerDialect.java]

mysql: 
[https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/mysql/MysqlDialect.java]

oracle merge into sql : 
[https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/oracle/OracleDialect.java]

postgres: 
[https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/psql/PostgresDialect.java]

postgres merg into sql : 
[https://www.postgresql.org/docs/current/sql-merge.html]

db2 merge into sql : 
[https://www.ibm.com/docs/en/db2-for-zos/12?topic=statements-merge]

derby merge into sql: 
[https://db.apache.org/derby/docs/10.14/ref/rrefsqljmerge.html]

he merg into sql : 
[https://www.tutorialspoint.com/h2_database/h2_database_merge.htm]

[~beliefer] [~cloud_fan] 

> [SQL] Spark JDBC Savemode Supports Upsert
> -
>
> Key: SPARK-38200
> URL: https://issues.apache.org/jira/browse/SPARK-38200
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: melin
>Priority: Major
>
> When writing data into a relational database, data duplication needs to be 
> considered. Both mysql and postgres support upsert syntax.
> mysql:
> {code:java}
> replace into t(id, update_time) values(1, now()); {code}
> pg:
> {code:java}
> INSERT INTO %s (id,name,data_time,remark) VALUES ( ?,?,?,? ) ON CONFLICT 
> (id,name) DO UPDATE SET 
> id=excluded.id,name=excluded.name,data_time=excluded.data_time,remark=excluded.remark
>    {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38200) [SQL] Spark JDBC Savemode Supports Upsert

2023-03-27 Thread melin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

melin updated SPARK-38200:
--
Summary: [SQL] Spark JDBC Savemode Supports Upsert  (was: [SQL] Spark JDBC 
Savemode Supports replace)

> [SQL] Spark JDBC Savemode Supports Upsert
> -
>
> Key: SPARK-38200
> URL: https://issues.apache.org/jira/browse/SPARK-38200
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: melin
>Priority: Major
>
> When writing data into a relational database, data duplication needs to be 
> considered. Both mysql and postgres support upsert syntax.
> mysql:
> {code:java}
> replace into t(id, update_time) values(1, now()); {code}
> pg:
> {code:java}
> INSERT INTO %s (id,name,data_time,remark) VALUES ( ?,?,?,? ) ON CONFLICT 
> (id,name) DO UPDATE SET 
> id=excluded.id,name=excluded.name,data_time=excluded.data_time,remark=excluded.remark
>    {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41630) Support lateral column alias in Project code path

2023-03-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-41630.
-
Fix Version/s: 3.4.0
 Assignee: Xinyi Yu
   Resolution: Fixed

> Support lateral column alias in Project code path
> -
>
> Key: SPARK-41630
> URL: https://issues.apache.org/jira/browse/SPARK-41630
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Xinyi Yu
>Assignee: Xinyi Yu
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-38200) [SQL] Spark JDBC Savemode Supports replace

2023-03-27 Thread melin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705765#comment-17705765
 ] 

melin edited comment on SPARK-38200 at 3/28/23 3:00 AM:


upsert sql for different databases, Most databases support merge sql:

sqlserver merge into sql : 
[https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/sqlserver/SqlServerDialect.java]

mysql: 
[https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/mysql/MysqlDialect.java]

oracle merge into sql : 
[https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/oracle/OracleDialect.java]

postgres: 
[https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/psql/PostgresDialect.java]

postgres merg into sql : 
[https://www.postgresql.org/docs/current/sql-merge.html]

db2 merge into sql : 
[https://www.ibm.com/docs/en/db2-for-zos/12?topic=statements-merge]

derby merge into sql: 
[https://db.apache.org/derby/docs/10.14/ref/rrefsqljmerge.html]

he merg into sql : 
[https://www.tutorialspoint.com/h2_database/h2_database_merge.htm]

[~beliefer] [~cloud_fan] 


was (Author: melin):
upsert sql for different databases, Most databases support merge sql:

sqlserver merge into sql : 
[https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/sqlserver/SqlServerDialect.java]

mysql: 
[https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/mysql/MysqlDialect.java]

oracle merge into sql : 
[https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/oracle/OracleDialect.java]

postgres: 
[https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/psql/PostgresDialect.java]

db2 merge into sql : 
[https://www.ibm.com/docs/en/db2-for-zos/12?topic=statements-merge]

derby merge into sql: 
[https://db.apache.org/derby/docs/10.14/ref/rrefsqljmerge.html]

he merg into sql : 
[https://www.tutorialspoint.com/h2_database/h2_database_merge.htm]

[~beliefer] [~cloud_fan] 

> [SQL] Spark JDBC Savemode Supports replace
> --
>
> Key: SPARK-38200
> URL: https://issues.apache.org/jira/browse/SPARK-38200
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: melin
>Priority: Major
>
> When writing data into a relational database, data duplication needs to be 
> considered. Both mysql and postgres support upsert syntax.
> mysql:
> {code:java}
> replace into t(id, update_time) values(1, now()); {code}
> pg:
> {code:java}
> INSERT INTO %s (id,name,data_time,remark) VALUES ( ?,?,?,? ) ON CONFLICT 
> (id,name) DO UPDATE SET 
> id=excluded.id,name=excluded.name,data_time=excluded.data_time,remark=excluded.remark
>    {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38200) [SQL] Spark JDBC Savemode Supports replace

2023-03-27 Thread melin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705765#comment-17705765
 ] 

melin commented on SPARK-38200:
---

upsert sql for different databases, Most databases support merge sql:

sqlserver merge into sql : 
[https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/sqlserver/SqlServerDialect.java]

mysql: 
[https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/mysql/MysqlDialect.java]

oracle merge into sql : 
[https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/oracle/OracleDialect.java]

postgres: 
[https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/psql/PostgresDialect.java]

db2 merge into sql : 
[https://www.ibm.com/docs/en/db2-for-zos/12?topic=statements-merge]

derby merge into sql: 
[https://db.apache.org/derby/docs/10.14/ref/rrefsqljmerge.html]

he merg into sql : 
[https://www.tutorialspoint.com/h2_database/h2_database_merge.htm]

[~beliefer] [~cloud_fan] 

> [SQL] Spark JDBC Savemode Supports replace
> --
>
> Key: SPARK-38200
> URL: https://issues.apache.org/jira/browse/SPARK-38200
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: melin
>Priority: Major
>
> When writing data into a relational database, data duplication needs to be 
> considered. Both mysql and postgres support upsert syntax.
> mysql:
> {code:java}
> replace into t(id, update_time) values(1, now()); {code}
> pg:
> {code:java}
> INSERT INTO %s (id,name,data_time,remark) VALUES ( ?,?,?,? ) ON CONFLICT 
> (id,name) DO UPDATE SET 
> id=excluded.id,name=excluded.name,data_time=excluded.data_time,remark=excluded.remark
>    {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42935) Optimze shuffle for union spark plan

2023-03-27 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-42935:

Fix Version/s: (was: 3.5.0)

> Optimze shuffle for union spark plan
> 
>
> Key: SPARK-42935
> URL: https://issues.apache.org/jira/browse/SPARK-42935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Jeff Min
>Priority: Major
>  Labels: pull-request-available
>
> Union plan does not take full advantage of children plan output partitionings 
> when output partitoning can't match parent plan's required distribution. For 
> example, Table1 and table2 are all bucketed table with bucket column id and 
> bucket number 100. We will do row_number window function after union the two 
> tables.
> {code:sql}
> create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO 
> 100 BUCKETS;
> insert into table1 values(1, "s1");
> insert into table1 values(2, "s2");
> ​
> create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO 
> 100 BUCKETS;
> insert into table2 values(1, "s3");
> ​
> set spark.sql.shuffle.partitions=100;
> explain select *, row_number() over(partition by id order by name desc) 
> id_row_number from (select * from table1 union all select * from 
> table2);{code}
> The physical plan is 
> {code:bash}
> AdaptiveSparkPlan isFinalPlan=false
> +- Window row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
> id_row_number#28, id#35, name#36 DESC NULLS LAST
>   +- Sort id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST, false, 0
>      +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, 
> [plan_id=88]
>         +- Union
>            :- FileScan csv spark_catalog.default.table1id#35,name#36
>            +- FileScan csv spark_catalog.default.table2id#37,name#38 {code}
>  
> Although the two tables are bucketed by id column, there's still a exchange 
> plan after union.The reason is that union plan's output partitioning is null.
> We can indroduce a new idea to optimize exchange plan:
>  # First introduce a new RDD, it consists of parent rdds that has the same 
> partition size. The ith parttition corresponds to ith partition of each 
> parent rdd.
>  # Then push the required distribution to union plan's children. If any child 
> output partitioning matches the required distribution , we can reduce this 
> child shuffle operation.
> After doing these, the physical plan does not contain exchange shuffle plan
> {code:bash}
> AdaptiveSparkPlan isFinalPlan=false
> +- Window row_number() windowspecdefinition(id#7, name#8 DESC NULLS LAST, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
> id_row_number#0, id#7, name#8 DESC NULLS LAST
>   +- Sort id#7 ASC NULLS FIRST, name#8 DESC NULLS LAST, false, 0
>      +- UnionZip ClusteredDistribution(ArrayBuffer(id#7),false,None), 
> ClusteredDistribution(ArrayBuffer(id#9),false,None), hashpartitioning(id#7, 
> 200)
>         :- FileScan csv spark_catalog.default.table1id#7,name#8
>         +- FileScan csv spark_catalog.default.table2id#9,name#10 {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42935) Optimze shuffle for union spark plan

2023-03-27 Thread Jeff Min (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Min updated SPARK-42935:
-
Labels: pull-request-available  (was: )

> Optimze shuffle for union spark plan
> 
>
> Key: SPARK-42935
> URL: https://issues.apache.org/jira/browse/SPARK-42935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Jeff Min
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> Union plan does not take full advantage of children plan output partitionings 
> when output partitoning can't match parent plan's required distribution. For 
> example, Table1 and table2 are all bucketed table with bucket column id and 
> bucket number 100. We will do row_number window function after union the two 
> tables.
> {code:sql}
> create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO 
> 100 BUCKETS;
> insert into table1 values(1, "s1");
> insert into table1 values(2, "s2");
> ​
> create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO 
> 100 BUCKETS;
> insert into table2 values(1, "s3");
> ​
> set spark.sql.shuffle.partitions=100;
> explain select *, row_number() over(partition by id order by name desc) 
> id_row_number from (select * from table1 union all select * from 
> table2);{code}
> The physical plan is 
> {code:bash}
> AdaptiveSparkPlan isFinalPlan=false
> +- Window row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
> id_row_number#28, id#35, name#36 DESC NULLS LAST
>   +- Sort id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST, false, 0
>      +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, 
> [plan_id=88]
>         +- Union
>            :- FileScan csv spark_catalog.default.table1id#35,name#36
>            +- FileScan csv spark_catalog.default.table2id#37,name#38 {code}
>  
> Although the two tables are bucketed by id column, there's still a exchange 
> plan after union.The reason is that union plan's output partitioning is null.
> We can indroduce a new idea to optimize exchange plan:
>  # First introduce a new RDD, it consists of parent rdds that has the same 
> partition size. The ith parttition corresponds to ith partition of each 
> parent rdd.
>  # Then push the required distribution to union plan's children. If any child 
> output partitioning matches the required distribution , we can reduce this 
> child shuffle operation.
> After doing these, the physical plan does not contain exchange shuffle plan
> {code:bash}
> AdaptiveSparkPlan isFinalPlan=false
> +- Window row_number() windowspecdefinition(id#7, name#8 DESC NULLS LAST, 
> specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
> id_row_number#0, id#7, name#8 DESC NULLS LAST
>   +- Sort id#7 ASC NULLS FIRST, name#8 DESC NULLS LAST, false, 0
>      +- UnionZip ClusteredDistribution(ArrayBuffer(id#7),false,None), 
> ClusteredDistribution(ArrayBuffer(id#9),false,None), hashpartitioning(id#7, 
> 200)
>         :- FileScan csv spark_catalog.default.table1id#7,name#8
>         +- FileScan csv spark_catalog.default.table2id#9,name#10 {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42935) Optimze shuffle for union spark plan

2023-03-27 Thread Jeff Min (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Min updated SPARK-42935:
-
Description: 
Union plan does not take full advantage of children plan output partitionings 
when output partitoning can't match parent plan's required distribution. For 
example, Table1 and table2 are all bucketed table with bucket column id and 
bucket number 100. We will do row_number window function after union the two 
tables.
{code:sql}
create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO 100 
BUCKETS;
insert into table1 values(1, "s1");
insert into table1 values(2, "s2");
​
create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO 100 
BUCKETS;
insert into table2 values(1, "s3");
​
set spark.sql.shuffle.partitions=100;
explain select *, row_number() over(partition by id order by name desc) 
id_row_number from (select * from table1 union all select * from table2);{code}
The physical plan is 
{code:bash}
AdaptiveSparkPlan isFinalPlan=false
+- Window row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
id_row_number#28, id#35, name#36 DESC NULLS LAST
  +- Sort id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST, false, 0
     +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, [plan_id=88]
        +- Union
           :- FileScan csv spark_catalog.default.table1id#35,name#36
           +- FileScan csv spark_catalog.default.table2id#37,name#38 {code}
 

Although the two tables are bucketed by id column, there's still a exchange 
plan after union.The reason is that union plan's output partitioning is null.

We can indroduce a new idea to optimize exchange plan:
 # First introduce a new RDD, it consists of parent rdds that has the same 
partition size. The ith parttition corresponds to ith partition of each parent 
rdd.

 # Then push the required distribution to union plan's children. If any child 
output partitioning matches the required distribution , we can reduce this 
child shuffle operation.

After doing these, the physical plan does not contain exchange shuffle plan
{code:bash}
AdaptiveSparkPlan isFinalPlan=false
+- Window row_number() windowspecdefinition(id#7, name#8 DESC NULLS LAST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
id_row_number#0, id#7, name#8 DESC NULLS LAST
  +- Sort id#7 ASC NULLS FIRST, name#8 DESC NULLS LAST, false, 0
     +- UnionZip ClusteredDistribution(ArrayBuffer(id#7),false,None), 
ClusteredDistribution(ArrayBuffer(id#9),false,None), hashpartitioning(id#7, 200)
        :- FileScan csv spark_catalog.default.table1id#7,name#8
        +- FileScan csv spark_catalog.default.table2id#9,name#10 {code}
 

 

  was:
Union plan does not take full advantage of children plan output partitionings 
when output partitoning can't match parent plan's required distribution. For 
example, Table1 and table2 are all bucketed table with bucket column id and 
bucket number 100. We will do row_number window function after union the two 
tables.
{code:sql}
create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO 100 
BUCKETS;
insert into table1 values(1, "s1");
insert into table1 values(2, "s2");
​
create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO 100 
BUCKETS;
insert into table2 values(1, "s3");
​
set spark.sql.shuffle.partitions=100;
explain select *, row_number() over(partition by id order by name desc) 
id_row_number from (select * from table1 union all select * from table2);{code}
The physical plan is 
{code:bash}
AdaptiveSparkPlan isFinalPlan=false
+- Window row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
id_row_number#28, id#35, name#36 DESC NULLS LAST
  +- Sort id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST, false, 0
     +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, [plan_id=88]
        +- Union
           :- FileScan csv spark_catalog.default.table1id#35,name#36
           +- FileScan csv spark_catalog.default.table2id#37,name#38 {code}
 

Although the two tables are bucketed by id column, there's still a exchange 
plan after union.The reason is that union plan's output partitioning is null.

We can indroduce a new idea to optimize exchange plan:
 # First introduce a new RDD, it consists of parent rdds that has the same 
partition size. The ith parttition corresponds to ith partition of each parent 
rdd.

 # Then push the required distribution to union plan's children. If any child 
output partitioning matches the required distribution , we can reduce this 
child shuffle operation.

After doing these, the physical plan is
{code:bash}
AdaptiveSparkPlan isFinalPlan=false
+- Window row_number() windowspecdefinition(id#7, name#8 DESC NULLS LAST, 
specifiedwindowframe(RowFrame, 

[jira] [Created] (SPARK-42942) Support coalesce table cache stage partitions

2023-03-27 Thread XiDuo You (Jira)
XiDuo You created SPARK-42942:
-

 Summary: Support coalesce table cache stage partitions
 Key: SPARK-42942
 URL: https://issues.apache.org/jira/browse/SPARK-42942
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: XiDuo You


If people cache a plan which holds some small partitions, then we can coalesce 
those just like what we did for shuffle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42941) Add support for streaming listener.

2023-03-27 Thread Raghu Angadi (Jira)
Raghu Angadi created SPARK-42941:


 Summary: Add support for streaming listener.
 Key: SPARK-42941
 URL: https://issues.apache.org/jira/browse/SPARK-42941
 Project: Spark
  Issue Type: Task
  Components: Connect, Structured Streaming
Affects Versions: 3.5.0
Reporter: Raghu Angadi


Add support of streaming listener in Python. 

This likely requires a design doc to hash out the details. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42940) Session management support streaming connect

2023-03-27 Thread Raghu Angadi (Jira)
Raghu Angadi created SPARK-42940:


 Summary: Session management support streaming connect
 Key: SPARK-42940
 URL: https://issues.apache.org/jira/browse/SPARK-42940
 Project: Spark
  Issue Type: Task
  Components: Connect, Structured Streaming
Affects Versions: 3.5.0
Reporter: Raghu Angadi


Add session support for streaming jobs. 

E.g. a session should stay alive when a streaming job is alive. 

It might differ more complex scenarios like what happens when client loses 
track of the session. Such semantics would be handled as part of session 
semantics across Spark Connect (including streaming). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42939) Streaming connect core API

2023-03-27 Thread Raghu Angadi (Jira)
Raghu Angadi created SPARK-42939:


 Summary: Streaming connect core API
 Key: SPARK-42939
 URL: https://issues.apache.org/jira/browse/SPARK-42939
 Project: Spark
  Issue Type: Task
  Components: Connect, Structured Streaming
Affects Versions: 3.5.0
Reporter: Raghu Angadi


This adds support for majority of streaming API, including query interactions. 

This does not include advanced features like custom streaming listener, and 
`foreachBatch()`. 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42937) Join with subquery in condition can fail with wholestage codegen and adaptive execution disabled

2023-03-27 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705702#comment-17705702
 ] 

Bruce Robbins commented on SPARK-42937:
---

PR at https://github.com/apache/spark/pull/40569

> Join with subquery in condition can fail with wholestage codegen and adaptive 
> execution disabled
> 
>
> Key: SPARK-42937
> URL: https://issues.apache.org/jira/browse/SPARK-42937
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The below left outer join gets an error:
> {noformat}
> create or replace temp view v1 as
> select * from values
> (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
> (2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
> (3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
> as v1(key, value1, value2, value3, value4, value5, value6, value7, value8, 
> value9, value10);
> create or replace temp view v2 as
> select * from values
> (1, 2),
> (3, 8),
> (7, 9)
> as v2(a, b);
> create or replace temp view v3 as
> select * from values
> (3),
> (8)
> as v3(col1);
> set spark.sql.codegen.maxFields=10; -- let's make maxFields 10 instead of 100
> set spark.sql.adaptive.enabled=false;
> select *
> from v1
> left outer join v2
> on key = a
> and key in (select col1 from v3);
> {noformat}
> The join fails during predicate codegen:
> {noformat}
> 23/03/27 12:24:12 WARN Predicate: Expr codegen error and falling back to 
> interpreter mode
> java.lang.IllegalArgumentException: requirement failed: input[0, int, false] 
> IN subquery#34 has not finished
>   at scala.Predef$.require(Predef.scala:281)
>   at 
> org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144)
>   at 
> org.apache.spark.sql.execution.InSubqueryExec.doGenCode(subquery.scala:156)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:201)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:196)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.$anonfun$generateExpressions$2(CodeGenerator.scala:1278)
>   at scala.collection.immutable.List.map(List.scala:293)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.generateExpressions(CodeGenerator.scala:1278)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:41)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:33)
>   at 
> org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:73)
>   at 
> org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:70)
>   at 
> org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.Predicate$.create(predicates.scala:86)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.boundCondition(HashJoin.scala:146)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.boundCondition$(HashJoin.scala:140)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition$lzycompute(BroadcastHashJoinExec.scala:40)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition(BroadcastHashJoinExec.scala:40)
> {noformat}
> It fails again after fallback to interpreter mode:
> {noformat}
> 23/03/27 12:24:12 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
> java.lang.IllegalArgumentException: requirement failed: input[0, int, false] 
> IN subquery#34 has not finished
>   at scala.Predef$.require(Predef.scala:281)
>   at 
> org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144)
>   at 
> org.apache.spark.sql.execution.InSubqueryExec.eval(subquery.scala:151)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:52)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2(HashJoin.scala:146)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2$adapted(HashJoin.scala:146)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.$anonfun$outerJoin$1(HashJoin.scala:205)
> {noformat}
> Both the predicate codegen and the evaluation fail for the same reason: 
> {{PlanSubqueries}} creates {{InSubqueryExec}} with {{shouldBroadcast=false}}. 
> The driver waits for the subquery to finish, but it's the executor that uses 
> the results of the subquery (for predicate codegen 

[jira] [Resolved] (SPARK-42382) Upgrade `cyclonedx-maven-plugin` to 2.7.5

2023-03-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42382.
---
Resolution: Won't Fix

We will revisit this when we are able to find a valid combination of Maven and 
cyclonedx-maven-plugin again in the future with the different versions.

> Upgrade `cyclonedx-maven-plugin` to 2.7.5
> -
>
> Key: SPARK-42382
> URL: https://issues.apache.org/jira/browse/SPARK-42382
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>
> [https://github.com/CycloneDX/cyclonedx-maven-plugin/releases/tag/cyclonedx-maven-plugin-2.7.4]
> [https://github.com/CycloneDX/cyclonedx-maven-plugin/releases/tag/cyclonedx-maven-plugin-2.7.5]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42380) Upgrade maven to 3.9.0

2023-03-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42380.
---
Resolution: Won't Fix

We will revisit this when we are able to find a valid combination of Maven and 
cyclonedx-maven-plugin again in the future.

> Upgrade maven to 3.9.0
> --
>
> Key: SPARK-42380
> URL: https://issues.apache.org/jira/browse/SPARK-42380
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>
> {code:java}
> [ERROR] An error occurred attempting to read POM
> org.codehaus.plexus.util.xml.pull.XmlPullParserException: UTF-8 BOM plus xml 
> decl of ISO-8859-1 is incompatible (position: START_DOCUMENT seen  version="1.0" encoding="ISO-8859-1"... @1:42) 
> at org.codehaus.plexus.util.xml.pull.MXParser.parseXmlDeclWithVersion 
> (MXParser.java:3423)
> at org.codehaus.plexus.util.xml.pull.MXParser.parseXmlDecl 
> (MXParser.java:3345)
> at org.codehaus.plexus.util.xml.pull.MXParser.parsePI (MXParser.java:3197)
> at org.codehaus.plexus.util.xml.pull.MXParser.parseProlog 
> (MXParser.java:1828)
> at org.codehaus.plexus.util.xml.pull.MXParser.nextImpl 
> (MXParser.java:1757)
> at org.codehaus.plexus.util.xml.pull.MXParser.next (MXParser.java:1375)
> at org.apache.maven.model.io.xpp3.MavenXpp3Reader.read 
> (MavenXpp3Reader.java:3940)
> at org.apache.maven.model.io.xpp3.MavenXpp3Reader.read 
> (MavenXpp3Reader.java:612)
> at org.apache.maven.model.io.xpp3.MavenXpp3Reader.read 
> (MavenXpp3Reader.java:627)
> at org.cyclonedx.maven.BaseCycloneDxMojo.readPom 
> (BaseCycloneDxMojo.java:759)
> at org.cyclonedx.maven.BaseCycloneDxMojo.readPom 
> (BaseCycloneDxMojo.java:746)
> at org.cyclonedx.maven.BaseCycloneDxMojo.retrieveParentProject 
> (BaseCycloneDxMojo.java:694)
> at org.cyclonedx.maven.BaseCycloneDxMojo.getClosestMetadata 
> (BaseCycloneDxMojo.java:524)
> at org.cyclonedx.maven.BaseCycloneDxMojo.convert 
> (BaseCycloneDxMojo.java:481)
> at org.cyclonedx.maven.CycloneDxMojo.execute (CycloneDxMojo.java:70)
> at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo 
> (DefaultBuildPluginManager.java:126)
> at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2 
> (MojoExecutor.java:342)
> at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute 
> (MojoExecutor.java:330)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
> (MojoExecutor.java:213)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
> (MojoExecutor.java:175)
> at org.apache.maven.lifecycle.internal.MojoExecutor.access$000 
> (MojoExecutor.java:76)
> at org.apache.maven.lifecycle.internal.MojoExecutor$1.run 
> (MojoExecutor.java:163)
> at org.apache.maven.plugin.DefaultMojosExecutionStrategy.execute 
> (DefaultMojosExecutionStrategy.java:39)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
> (MojoExecutor.java:160)
> at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject 
> (LifecycleModuleBuilder.java:105)
> at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject 
> (LifecycleModuleBuilder.java:73)
> at 
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build
>  (SingleThreadedBuilder.java:53)
> at org.apache.maven.lifecycle.internal.LifecycleStarter.execute 
> (LifecycleStarter.java:118)
> at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:260)
> at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:172)
> at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:100)
> at org.apache.maven.cli.MavenCli.execute (MavenCli.java:821)
> at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:270)
> at org.apache.maven.cli.MavenCli.main (MavenCli.java:192)
> at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke 
> (NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke 
> (DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke (Method.java:498)
> at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced 
> (Launcher.java:282)
> at org.codehaus.plexus.classworlds.launcher.Launcher.launch 
> (Launcher.java:225)
> at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode 
> (Launcher.java:406)
> at org.codehaus.plexus.classworlds.launcher.Launcher.main 
> (Launcher.java:347)
> {code}
> A existing problem



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional 

[jira] [Created] (SPARK-42938) Structured Streaming with Spark Connect

2023-03-27 Thread Raghu Angadi (Jira)
Raghu Angadi created SPARK-42938:


 Summary: Structured Streaming with Spark Connect
 Key: SPARK-42938
 URL: https://issues.apache.org/jira/browse/SPARK-42938
 Project: Spark
  Issue Type: Epic
  Components: Connect, Structured Streaming
Affects Versions: 3.5.0
Reporter: Raghu Angadi


Spark-connect (SPARK-39375) should support structured streaming. 

This Epic covers various tasks for adding streaming support.

The current focus is on {*}Python API{*}, unless otherwise stated, the tasks 
below are for Python API. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42906) Replace a starting digit with `x` in resource name prefix

2023-03-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-42906:
--
Summary: Replace a starting digit with `x` in resource name prefix  (was: 
Replace a starting digit with x in resource name prefix)

> Replace a starting digit with `x` in resource name prefix
> -
>
> Key: SPARK-42906
> URL: https://issues.apache.org/jira/browse/SPARK-42906
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.3, 3.3.2, 3.4.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
> Fix For: 3.2.4, 3.3.3, 3.4.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42906) Replace a starting digit with x in resource name prefix

2023-03-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-42906:
--
Summary: Replace a starting digit with x in resource name prefix  (was: 
Resource name prefix should start with an alphabetic character)

> Replace a starting digit with x in resource name prefix
> ---
>
> Key: SPARK-42906
> URL: https://issues.apache.org/jira/browse/SPARK-42906
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.3, 3.3.2, 3.4.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
> Fix For: 3.2.4, 3.3.3, 3.4.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42906) Resource name prefix should start with an alphabetic character

2023-03-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-42906:
-

Assignee: Cheng Pan

> Resource name prefix should start with an alphabetic character
> --
>
> Key: SPARK-42906
> URL: https://issues.apache.org/jira/browse/SPARK-42906
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.3, 3.3.2, 3.4.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42906) Resource name prefix should start with an alphabetic character

2023-03-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42906.
---
Fix Version/s: 3.3.3
   3.2.4
   3.4.1
   Resolution: Fixed

Issue resolved by pull request 40533
[https://github.com/apache/spark/pull/40533]

> Resource name prefix should start with an alphabetic character
> --
>
> Key: SPARK-42906
> URL: https://issues.apache.org/jira/browse/SPARK-42906
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.3, 3.3.2, 3.4.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
> Fix For: 3.3.3, 3.2.4, 3.4.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42906) Resource name prefix should start with an alphabetic character

2023-03-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-42906:
--
Affects Version/s: 3.3.2
   3.4.0

> Resource name prefix should start with an alphabetic character
> --
>
> Key: SPARK-42906
> URL: https://issues.apache.org/jira/browse/SPARK-42906
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.3, 3.3.2, 3.4.0
>Reporter: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42937) Join with subquery in condition can fail with wholestage codegen and adaptive execution disabled

2023-03-27 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-42937:
--
Affects Version/s: 3.4.0

> Join with subquery in condition can fail with wholestage codegen and adaptive 
> execution disabled
> 
>
> Key: SPARK-42937
> URL: https://issues.apache.org/jira/browse/SPARK-42937
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The below left outer join gets an error:
> {noformat}
> create or replace temp view v1 as
> select * from values
> (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
> (2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
> (3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
> as v1(key, value1, value2, value3, value4, value5, value6, value7, value8, 
> value9, value10);
> create or replace temp view v2 as
> select * from values
> (1, 2),
> (3, 8),
> (7, 9)
> as v2(a, b);
> create or replace temp view v3 as
> select * from values
> (3),
> (8)
> as v3(col1);
> set spark.sql.codegen.maxFields=10; -- let's make maxFields 10 instead of 100
> set spark.sql.adaptive.enabled=false;
> select *
> from v1
> left outer join v2
> on key = a
> and key in (select col1 from v3);
> {noformat}
> The join fails during predicate codegen:
> {noformat}
> 23/03/27 12:24:12 WARN Predicate: Expr codegen error and falling back to 
> interpreter mode
> java.lang.IllegalArgumentException: requirement failed: input[0, int, false] 
> IN subquery#34 has not finished
>   at scala.Predef$.require(Predef.scala:281)
>   at 
> org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144)
>   at 
> org.apache.spark.sql.execution.InSubqueryExec.doGenCode(subquery.scala:156)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:201)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:196)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.$anonfun$generateExpressions$2(CodeGenerator.scala:1278)
>   at scala.collection.immutable.List.map(List.scala:293)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.generateExpressions(CodeGenerator.scala:1278)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:41)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:33)
>   at 
> org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:73)
>   at 
> org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:70)
>   at 
> org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.Predicate$.create(predicates.scala:86)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.boundCondition(HashJoin.scala:146)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.boundCondition$(HashJoin.scala:140)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition$lzycompute(BroadcastHashJoinExec.scala:40)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition(BroadcastHashJoinExec.scala:40)
> {noformat}
> It fails again after fallback to interpreter mode:
> {noformat}
> 23/03/27 12:24:12 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
> java.lang.IllegalArgumentException: requirement failed: input[0, int, false] 
> IN subquery#34 has not finished
>   at scala.Predef$.require(Predef.scala:281)
>   at 
> org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144)
>   at 
> org.apache.spark.sql.execution.InSubqueryExec.eval(subquery.scala:151)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:52)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2(HashJoin.scala:146)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2$adapted(HashJoin.scala:146)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.$anonfun$outerJoin$1(HashJoin.scala:205)
> {noformat}
> Both the predicate codegen and the evaluation fail for the same reason: 
> {{PlanSubqueries}} creates {{InSubqueryExec}} with {{shouldBroadcast=false}}. 
> The driver waits for the subquery to finish, but it's the executor that uses 
> the results of the subquery (for predicate codegen or evaluation). Because 
> {{shouldBroadcast}} is set to 

[jira] [Updated] (SPARK-42937) Join with subquery in condition can fail with wholestage codegen and adaptive execution disabled

2023-03-27 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-42937:
--
Affects Version/s: 3.3.2

> Join with subquery in condition can fail with wholestage codegen and adaptive 
> execution disabled
> 
>
> Key: SPARK-42937
> URL: https://issues.apache.org/jira/browse/SPARK-42937
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>
> The below left outer join gets an error:
> {noformat}
> create or replace temp view v1 as
> select * from values
> (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
> (2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
> (3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
> as v1(key, value1, value2, value3, value4, value5, value6, value7, value8, 
> value9, value10);
> create or replace temp view v2 as
> select * from values
> (1, 2),
> (3, 8),
> (7, 9)
> as v2(a, b);
> create or replace temp view v3 as
> select * from values
> (3),
> (8)
> as v3(col1);
> set spark.sql.codegen.maxFields=10; -- let's make maxFields 10 instead of 100
> set spark.sql.adaptive.enabled=false;
> select *
> from v1
> left outer join v2
> on key = a
> and key in (select col1 from v3);
> {noformat}
> The join fails during predicate codegen:
> {noformat}
> 23/03/27 12:24:12 WARN Predicate: Expr codegen error and falling back to 
> interpreter mode
> java.lang.IllegalArgumentException: requirement failed: input[0, int, false] 
> IN subquery#34 has not finished
>   at scala.Predef$.require(Predef.scala:281)
>   at 
> org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144)
>   at 
> org.apache.spark.sql.execution.InSubqueryExec.doGenCode(subquery.scala:156)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:201)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:196)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.$anonfun$generateExpressions$2(CodeGenerator.scala:1278)
>   at scala.collection.immutable.List.map(List.scala:293)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.generateExpressions(CodeGenerator.scala:1278)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:41)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:33)
>   at 
> org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:73)
>   at 
> org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:70)
>   at 
> org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.Predicate$.create(predicates.scala:86)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.boundCondition(HashJoin.scala:146)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.boundCondition$(HashJoin.scala:140)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition$lzycompute(BroadcastHashJoinExec.scala:40)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition(BroadcastHashJoinExec.scala:40)
> {noformat}
> It fails again after fallback to interpreter mode:
> {noformat}
> 23/03/27 12:24:12 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
> java.lang.IllegalArgumentException: requirement failed: input[0, int, false] 
> IN subquery#34 has not finished
>   at scala.Predef$.require(Predef.scala:281)
>   at 
> org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144)
>   at 
> org.apache.spark.sql.execution.InSubqueryExec.eval(subquery.scala:151)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:52)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2(HashJoin.scala:146)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2$adapted(HashJoin.scala:146)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin.$anonfun$outerJoin$1(HashJoin.scala:205)
> {noformat}
> Both the predicate codegen and the evaluation fail for the same reason: 
> {{PlanSubqueries}} creates {{InSubqueryExec}} with {{shouldBroadcast=false}}. 
> The driver waits for the subquery to finish, but it's the executor that uses 
> the results of the subquery (for predicate codegen or evaluation). Because 
> {{shouldBroadcast}} is set to false, the 

[jira] [Created] (SPARK-42937) Join with subquery in condition can fail with wholestage codegen and adaptive execution disabled

2023-03-27 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-42937:
-

 Summary: Join with subquery in condition can fail with wholestage 
codegen and adaptive execution disabled
 Key: SPARK-42937
 URL: https://issues.apache.org/jira/browse/SPARK-42937
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Bruce Robbins


The below left outer join gets an error:
{noformat}
create or replace temp view v1 as
select * from values
(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
(3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
as v1(key, value1, value2, value3, value4, value5, value6, value7, value8, 
value9, value10);

create or replace temp view v2 as
select * from values
(1, 2),
(3, 8),
(7, 9)
as v2(a, b);

create or replace temp view v3 as
select * from values
(3),
(8)
as v3(col1);

set spark.sql.codegen.maxFields=10; -- let's make maxFields 10 instead of 100
set spark.sql.adaptive.enabled=false;

select *
from v1
left outer join v2
on key = a
and key in (select col1 from v3);
{noformat}
The join fails during predicate codegen:
{noformat}
23/03/27 12:24:12 WARN Predicate: Expr codegen error and falling back to 
interpreter mode
java.lang.IllegalArgumentException: requirement failed: input[0, int, false] IN 
subquery#34 has not finished
at scala.Predef$.require(Predef.scala:281)
at 
org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144)
at 
org.apache.spark.sql.execution.InSubqueryExec.doGenCode(subquery.scala:156)
at 
org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:201)
at scala.Option.getOrElse(Option.scala:189)
at 
org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:196)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.$anonfun$generateExpressions$2(CodeGenerator.scala:1278)
at scala.collection.immutable.List.map(List.scala:293)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.generateExpressions(CodeGenerator.scala:1278)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:41)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:33)
at 
org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:73)
at 
org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:70)
at 
org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:51)
at 
org.apache.spark.sql.catalyst.expressions.Predicate$.create(predicates.scala:86)
at 
org.apache.spark.sql.execution.joins.HashJoin.boundCondition(HashJoin.scala:146)
at 
org.apache.spark.sql.execution.joins.HashJoin.boundCondition$(HashJoin.scala:140)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition$lzycompute(BroadcastHashJoinExec.scala:40)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition(BroadcastHashJoinExec.scala:40)
{noformat}
It fails again after fallback to interpreter mode:
{noformat}
23/03/27 12:24:12 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
java.lang.IllegalArgumentException: requirement failed: input[0, int, false] IN 
subquery#34 has not finished
at scala.Predef$.require(Predef.scala:281)
at 
org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144)
at 
org.apache.spark.sql.execution.InSubqueryExec.eval(subquery.scala:151)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:52)
at 
org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2(HashJoin.scala:146)
at 
org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2$adapted(HashJoin.scala:146)
at 
org.apache.spark.sql.execution.joins.HashJoin.$anonfun$outerJoin$1(HashJoin.scala:205)
{noformat}
Both the predicate codegen and the evaluation fail for the same reason: 
{{PlanSubqueries}} creates {{InSubqueryExec}} with {{shouldBroadcast=false}}. 
The driver waits for the subquery to finish, but it's the executor that uses 
the results of the subquery (for predicate codegen or evaluation). Because 
{{shouldBroadcast}} is set to false, the result is stored in a transient field 
({{InSubqueryExec#result}}), so the result of the subquery is not serialized 
when the {{InSubqueryExec}} instance is sent to the executor.

When wholestage codegen is enabled, the predicate codegen happens on the 
driver, so the subquery's result is available. When adaptive execution is 
enabled, {{PlanAdaptiveSubqueries}} always sets {{shouldBroadcast=true}}, so 
the 

[jira] [Resolved] (SPARK-36999) Document the command ALTER TABLE RECOVER PARTITIONS

2023-03-27 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-36999.
--
Resolution: Duplicate

> Document the command ALTER TABLE RECOVER PARTITIONS
> ---
>
> Key: SPARK-36999
> URL: https://issues.apache.org/jira/browse/SPARK-36999
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>  Labels: starter
>
> Update the page 
> [https://spark.apache.org/docs/3.1.2/sql-ref-syntax-ddl-alter-table.html,] 
> and document the command ALTER TABLE RECOVER PARTITIONS



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42936) Unresolved having at the end of analysis when using with LCA with the having clause that can be resolved directly by its child Aggregate

2023-03-27 Thread Xinyi Yu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705481#comment-17705481
 ] 

Xinyi Yu commented on SPARK-42936:
--

Created PR fixing the issue: https://github.com/apache/spark/pull/40558

> Unresolved having at the end of analysis when using with LCA with the having 
> clause that can be resolved directly by its child Aggregate
> 
>
> Key: SPARK-42936
> URL: https://issues.apache.org/jira/browse/SPARK-42936
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Xinyi Yu
>Priority: Major
>
> {code:java}
> select sum(value1) as total_1, total_1
> from values(1, 'name', 100, 50) AS data(id, name, value1, value2)
> having total_1 > 0
> SparkException: [INTERNAL_ERROR] Found the unresolved operator: 
> 'UnresolvedHaving (total_1#353L > cast(0 as bigint)) {code}
> To trigger the issue, the having condition need to be (can be resolved by) an 
> attribute in the select.
> Without the LCA {{{}total_1{}}}, the query works fine.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42936) Unresolved having at the end of analysis when using with LCA with the having clause that can be resolved directly by its child Aggregate

2023-03-27 Thread Xinyi Yu (Jira)
Xinyi Yu created SPARK-42936:


 Summary: Unresolved having at the end of analysis when using with 
LCA with the having clause that can be resolved directly by its child Aggregate
 Key: SPARK-42936
 URL: https://issues.apache.org/jira/browse/SPARK-42936
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Xinyi Yu


{code:java}
select sum(value1) as total_1, total_1
from values(1, 'name', 100, 50) AS data(id, name, value1, value2)
having total_1 > 0

SparkException: [INTERNAL_ERROR] Found the unresolved operator: 
'UnresolvedHaving (total_1#353L > cast(0 as bigint)) {code}
To trigger the issue, the having condition need to be (can be resolved by) an 
attribute in the select.
Without the LCA {{{}total_1{}}}, the query works fine.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42934) Testing OrcEncryptionSuite using maven is always skipped

2023-03-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42934.
---
Fix Version/s: 3.3.3
   3.2.4
   3.4.1
   Resolution: Fixed

Issue resolved by pull request 40566
[https://github.com/apache/spark/pull/40566]

> Testing OrcEncryptionSuite using maven is always skipped
> 
>
> Key: SPARK-42934
> URL: https://issues.apache.org/jira/browse/SPARK-42934
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.2.3, 3.3.2, 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.3.3, 3.2.4, 3.4.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42934) Testing OrcEncryptionSuite using maven is always skipped

2023-03-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-42934:
--
Affects Version/s: 3.3.2
   3.2.3
   3.4.0
   (was: 3.5.0)

> Testing OrcEncryptionSuite using maven is always skipped
> 
>
> Key: SPARK-42934
> URL: https://issues.apache.org/jira/browse/SPARK-42934
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.2.3, 3.3.2, 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42934) Testing OrcEncryptionSuite using maven is always skipped

2023-03-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-42934:
-

Assignee: Yang Jie

> Testing OrcEncryptionSuite using maven is always skipped
> 
>
> Key: SPARK-42934
> URL: https://issues.apache.org/jira/browse/SPARK-42934
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.2.3, 3.3.2, 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42934) Testing OrcEncryptionSuite using maven is always skipped

2023-03-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-42934:
--
Priority: Minor  (was: Major)

> Testing OrcEncryptionSuite using maven is always skipped
> 
>
> Key: SPARK-42934
> URL: https://issues.apache.org/jira/browse/SPARK-42934
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42930) Change the access scope of `ProtobufSerDe` related implementations to `private[spark]`

2023-03-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-42930:
-

Assignee: Yang Jie

> Change the access scope of `ProtobufSerDe` related implementations to 
> `private[spark]`
> --
>
> Key: SPARK-42930
> URL: https://issues.apache.org/jira/browse/SPARK-42930
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42930) Change the access scope of `ProtobufSerDe` related implementations to `private[spark]`

2023-03-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42930.
---
Fix Version/s: 3.4.1
   Resolution: Fixed

Issue resolved by pull request 40560
[https://github.com/apache/spark/pull/40560]

> Change the access scope of `ProtobufSerDe` related implementations to 
> `private[spark]`
> --
>
> Key: SPARK-42930
> URL: https://issues.apache.org/jira/browse/SPARK-42930
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.4.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42913) Upgrade Hadoop to 3.3.5

2023-03-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42913.
---
Fix Version/s: 3.5.0
 Assignee: Yang Jie
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/39124

> Upgrade Hadoop to 3.3.5
> ---
>
> Key: SPARK-42913
> URL: https://issues.apache.org/jira/browse/SPARK-42913
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0
>
>
> * 
> [https://hadoop.apache.org/docs/r3.3.5/hadoop-project-dist/hadoop-common/release/3.3.5/CHANGELOG.3.3.5.html]
>  * 
> [https://hadoop.apache.org/docs/r3.3.5/hadoop-project-dist/hadoop-common/release/3.3.5/RELEASENOTES.3.3.5.html]
>  * https://hadoop.apache.org/docs/r3.3.5/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42935) Optimze shuffle for union spark plan

2023-03-27 Thread Jeff Min (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Min updated SPARK-42935:
-
Description: 
Union plan does not take full advantage of children plan output partitionings 
when output partitoning can't match parent plan's required distribution. For 
example, Table1 and table2 are all bucketed table with bucket column id and 
bucket number 100. We will do row_number window function after union the two 
tables.
{code:sql}
create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO 100 
BUCKETS;
insert into table1 values(1, "s1");
insert into table1 values(2, "s2");
​
create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO 100 
BUCKETS;
insert into table2 values(1, "s3");
​
set spark.sql.shuffle.partitions=100;
explain select *, row_number() over(partition by id order by name desc) 
id_row_number from (select * from table1 union all select * from table2);{code}
The physical plan is 
{code:bash}
AdaptiveSparkPlan isFinalPlan=false
+- Window row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
id_row_number#28, id#35, name#36 DESC NULLS LAST
  +- Sort id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST, false, 0
     +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, [plan_id=88]
        +- Union
           :- FileScan csv spark_catalog.default.table1id#35,name#36
           +- FileScan csv spark_catalog.default.table2id#37,name#38 {code}
 

Although the two tables are bucketed by id column, there's still a exchange 
plan after union.The reason is that union plan's output partitioning is null.

We can indroduce a new idea to optimize exchange plan:
 # First introduce a new RDD, it consists of parent rdds that has the same 
partition size. The ith parttition corresponds to ith partition of each parent 
rdd.

 # Then push the required distribution to union plan's children. If any child 
output partitioning matches the required distribution , we can reduce this 
child shuffle operation.

After doing these, the physical plan is
{code:bash}
AdaptiveSparkPlan isFinalPlan=false
+- Window row_number() windowspecdefinition(id#7, name#8 DESC NULLS LAST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
id_row_number#0, id#7, name#8 DESC NULLS LAST
  +- Sort id#7 ASC NULLS FIRST, name#8 DESC NULLS LAST, false, 0
     +- UnionZip ClusteredDistribution(ArrayBuffer(id#7),false,None), 
ClusteredDistribution(ArrayBuffer(id#9),false,None), hashpartitioning(id#7, 200)
        :- FileScan csv spark_catalog.default.table1id#7,name#8
        +- FileScan csv spark_catalog.default.table2id#9,name#10 {code}
 

 

  was:
Union plan does not take full advantage of children plan output partitionings 
when output partitoning can't match parent plan's required distribution. For 
example, Table1 and table2 are all bucketed table with bucket column id and 
bucket number 100. We will do row_number window function after union the two 
tables.
{code:sql}
create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO 100 
BUCKETS;
insert into table1 values(1, "s1");
insert into table1 values(2, "s2");
​
create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO 100 
BUCKETS;
insert into table2 values(1, "s3");
​
set spark.sql.shuffle.partitions=100;
explain select *, row_number() over(partition by id order by name desc) 
id_row_number from (select * from table1 union all select * from table2);{code}

The physical plan is 
{code:bash}
AdaptiveSparkPlan isFinalPlan=false
+- Window row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
id_row_number#28, id#35, name#36 DESC NULLS LAST
  +- Sort id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST, false, 0
     +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, [plan_id=88]
        +- Union
           :- FileScan csv spark_catalog.default.table1id#35,name#36
           +- FileScan csv spark_catalog.default.table2id#37,name#38 {code}
 

Although the two tables are bucketed by id column, there's still a exchange 
plan after union.The reason is that union plan's output partitioning is null.

We can indroduce a new idea to optimize exchange plan:
 # First introduce a new RDD, it consists of parent rdds that has the same 
partition size. The ith parttition corresponds to ith partition of each parent 
rdd.

 # Then push the required distribution to union plan's children. If any child 
output partitioning matches the required distribution , we can reduce this 
child shuffle operation.

After doing these, the physical plan is
{code:bash}
daptiveSparkPlan isFinalPlan=false
+- Window row_number() windowspecdefinition(id#7, name#8 DESC NULLS LAST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 

[jira] [Updated] (SPARK-42935) Optimze shuffle for union spark plan

2023-03-27 Thread Jeff Min (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Min updated SPARK-42935:
-
Description: 
Union plan does not take full advantage of children plan output partitionings 
when output partitoning can't match parent plan's required distribution. For 
example, Table1 and table2 are all bucketed table with bucket column id and 
bucket number 100. We will do row_number window function after union the two 
tables.
{code:sql}
create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO 100 
BUCKETS;
insert into table1 values(1, "s1");
insert into table1 values(2, "s2");
​
create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO 100 
BUCKETS;
insert into table2 values(1, "s3");
​
set spark.sql.shuffle.partitions=100;
explain select *, row_number() over(partition by id order by name desc) 
id_row_number from (select * from table1 union all select * from table2);{code}

The physical plan is 
{code:bash}
Unable to find source-code formatter for language: shell. Available languages 
are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, 
groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, 
php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, 
yamlAdaptiveSparkPlan isFinalPlan=false
+- Window row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
id_row_number#28, id#35, name#36 DESC NULLS LAST
  +- Sort id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST, false, 0
     +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, [plan_id=88]
        +- Union
           :- FileScan csv spark_catalog.default.table1id#35,name#36
           +- FileScan csv spark_catalog.default.table2id#37,name#38 {code}
 

Although the two tables are bucketed by id column, there's still a exchange 
plan after union.The reason is that union plan's output partitioning is null.

We can indroduce a new idea to optimize exchange plan:
 # First introduce a new RDD, it consists of parent rdds that has the same 
partition size. The ith parttition corresponds to ith partition of each parent 
rdd.

 # Then push the required distribution to union plan's children. If any child 
output partitioning matches the required distribution , we can reduce this 
child shuffle operation.

After doing these, the physical plan is
{code:bash}
daptiveSparkPlan isFinalPlan=false
+- Window row_number() windowspecdefinition(id#7, name#8 DESC NULLS LAST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
id_row_number#0, id#7, name#8 DESC NULLS LAST
  +- Sort id#7 ASC NULLS FIRST, name#8 DESC NULLS LAST, false, 0
     +- UnionZip ClusteredDistribution(ArrayBuffer(id#7),false,None), 
ClusteredDistribution(ArrayBuffer(id#9),false,None), hashpartitioning(id#7, 200)
        :- FileScan csv spark_catalog.default.table1id#7,name#8
        +- FileScan csv spark_catalog.default.table2id#9,name#10 {code}


 

 

  was:
Union plan does not take full advantage of children plan output partitionings 
when output partitoning can't match parent plan's required distribution. For 
example, Table1 and table2 are all bucketed table with bucket column id and 
bucket number 100. We will do row_number window function after union the two 
tables.
{code:sql}
create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO 100 
BUCKETS;
insert into table1 values(1, "s1");
insert into table1 values(2, "s2");
​
create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO 100 
BUCKETS;
insert into table2 values(1, "s3");
​
set spark.sql.shuffle.partitions=100;
explain select *, row_number() over(partition by id order by name desc) 
id_row_number from (select * from table1 union all select * from table2);{code}
The physical plan is 
{code:bash}
Unable to find source-code formatter for language: shell. Available languages 
are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, 
groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, 
php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, 
yamlAdaptiveSparkPlan isFinalPlan=false
+- Window row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
id_row_number#28, id#35, name#36 DESC NULLS LAST
  +- Sort id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST, false, 0
     +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, [plan_id=88]
        +- Union
           :- FileScan csv spark_catalog.default.table1id#35,name#36
           +- FileScan csv spark_catalog.default.table2id#37,name#38 {code}
 

Although the two tables are bucketed by id column, there's still a exchange 
plan after union.The reason is that union plan's output partitioning is null.

We can 

[jira] [Updated] (SPARK-42935) Optimze shuffle for union spark plan

2023-03-27 Thread Jeff Min (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Min updated SPARK-42935:
-
Description: 
Union plan does not take full advantage of children plan output partitionings 
when output partitoning can't match parent plan's required distribution. For 
example, Table1 and table2 are all bucketed table with bucket column id and 
bucket number 100. We will do row_number window function after union the two 
tables.
{code:sql}
create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO 100 
BUCKETS;
insert into table1 values(1, "s1");
insert into table1 values(2, "s2");
​
create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO 100 
BUCKETS;
insert into table2 values(1, "s3");
​
set spark.sql.shuffle.partitions=100;
explain select *, row_number() over(partition by id order by name desc) 
id_row_number from (select * from table1 union all select * from table2);{code}

The physical plan is 
{code:bash}
AdaptiveSparkPlan isFinalPlan=false
+- Window row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
id_row_number#28, id#35, name#36 DESC NULLS LAST
  +- Sort id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST, false, 0
     +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, [plan_id=88]
        +- Union
           :- FileScan csv spark_catalog.default.table1id#35,name#36
           +- FileScan csv spark_catalog.default.table2id#37,name#38 {code}
 

Although the two tables are bucketed by id column, there's still a exchange 
plan after union.The reason is that union plan's output partitioning is null.

We can indroduce a new idea to optimize exchange plan:
 # First introduce a new RDD, it consists of parent rdds that has the same 
partition size. The ith parttition corresponds to ith partition of each parent 
rdd.

 # Then push the required distribution to union plan's children. If any child 
output partitioning matches the required distribution , we can reduce this 
child shuffle operation.

After doing these, the physical plan is
{code:bash}
daptiveSparkPlan isFinalPlan=false
+- Window row_number() windowspecdefinition(id#7, name#8 DESC NULLS LAST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
id_row_number#0, id#7, name#8 DESC NULLS LAST
  +- Sort id#7 ASC NULLS FIRST, name#8 DESC NULLS LAST, false, 0
     +- UnionZip ClusteredDistribution(ArrayBuffer(id#7),false,None), 
ClusteredDistribution(ArrayBuffer(id#9),false,None), hashpartitioning(id#7, 200)
        :- FileScan csv spark_catalog.default.table1id#7,name#8
        +- FileScan csv spark_catalog.default.table2id#9,name#10 {code}


 

 

  was:
Union plan does not take full advantage of children plan output partitionings 
when output partitoning can't match parent plan's required distribution. For 
example, Table1 and table2 are all bucketed table with bucket column id and 
bucket number 100. We will do row_number window function after union the two 
tables.
{code:sql}
create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO 100 
BUCKETS;
insert into table1 values(1, "s1");
insert into table1 values(2, "s2");
​
create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO 100 
BUCKETS;
insert into table2 values(1, "s3");
​
set spark.sql.shuffle.partitions=100;
explain select *, row_number() over(partition by id order by name desc) 
id_row_number from (select * from table1 union all select * from table2);{code}

The physical plan is 
{code:bash}
Unable to find source-code formatter for language: shell. Available languages 
are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, 
groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, 
php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, 
yamlAdaptiveSparkPlan isFinalPlan=false
+- Window row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
id_row_number#28, id#35, name#36 DESC NULLS LAST
  +- Sort id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST, false, 0
     +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, [plan_id=88]
        +- Union
           :- FileScan csv spark_catalog.default.table1id#35,name#36
           +- FileScan csv spark_catalog.default.table2id#37,name#38 {code}
 

Although the two tables are bucketed by id column, there's still a exchange 
plan after union.The reason is that union plan's output partitioning is null.

We can indroduce a new idea to optimize exchange plan:
 # First introduce a new RDD, it consists of parent rdds that has the same 
partition size. The ith parttition corresponds to ith partition of each parent 
rdd.

 # Then push the required distribution to union plan's children. If any child 
output partitioning matches 

[jira] [Updated] (SPARK-42935) Optimze shuffle for union spark plan

2023-03-27 Thread Jeff Min (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Min updated SPARK-42935:
-
Description: 
Union plan does not take full advantage of children plan output partitionings 
when output partitoning can't match parent plan's required distribution. For 
example, Table1 and table2 are all bucketed table with bucket column id and 
bucket number 100. We will do row_number window function after union the two 
tables.
{code:sql}
create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO 100 
BUCKETS;
insert into table1 values(1, "s1");
insert into table1 values(2, "s2");
​
create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO 100 
BUCKETS;
insert into table2 values(1, "s3");
​
set spark.sql.shuffle.partitions=100;
explain select *, row_number() over(partition by id order by name desc) 
id_row_number from (select * from table1 union all select * from table2);{code}
The physical plan is 
{code:bash}
Unable to find source-code formatter for language: shell. Available languages 
are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, 
groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, 
php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, 
yamlAdaptiveSparkPlan isFinalPlan=false
+- Window row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
id_row_number#28, id#35, name#36 DESC NULLS LAST
  +- Sort id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST, false, 0
     +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, [plan_id=88]
        +- Union
           :- FileScan csv spark_catalog.default.table1id#35,name#36
           +- FileScan csv spark_catalog.default.table2id#37,name#38 {code}
 

Although the two tables are bucketed by id column, there's still a exchange 
plan after union.The reason is that union plan's output partitioning is null.

We can indroduce a new idea to optimize exchange plan:
 # First introduce a new RDD, it consists of parent rdds that has the same 
partition size. The ith parttition corresponds to ith partition of each parent 
rdd.

 # Then push the required distribution to union plan's children. If any child 
output partitioning matches the required distribution , we can reduce this 
child shuffle operation.

After doing these, the physical plan is
{code:bash}
daptiveSparkPlan isFinalPlan=false
+- Window row_number() windowspecdefinition(id#7, name#8 DESC NULLS LAST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
id_row_number#0, id#7, name#8 DESC NULLS LAST
  +- Sort id#7 ASC NULLS FIRST, name#8 DESC NULLS LAST, false, 0
     +- UnionZip ClusteredDistribution(ArrayBuffer(id#7),false,None), 
ClusteredDistribution(ArrayBuffer(id#9),false,None), hashpartitioning(id#7, 200)
        :- FileScan csv spark_catalog.default.table1id#7,name#8
        +- FileScan csv spark_catalog.default.table2id#9,name#10 {code}


 

 

  was:
Union plan does not take full advantage of children plan output partitionings 
when output partitoning can't match parent plan's required distribution. For 
example, Table1 and table2 are all bucketed table with bucket column id and 
bucket number 100. We will do row_number window function after union the two 
tables.
{code:sql}
create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO 100 
BUCKETS;
insert into table1 values(1, "s1");
insert into table1 values(2, "s2");
​
create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO 100 
BUCKETS;
insert into table2 values(1, "s3");
​
set spark.sql.shuffle.partitions=100;
explain select *, row_number() over(partition by id order by name desc) 
id_row_number from (select * from table1 union all select * from table2);{code}
The physical plan is 
{code:shell}
Unable to find source-code formatter for language: shell. Available languages 
are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, 
groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, 
php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, 
yamlAdaptiveSparkPlan isFinalPlan=false
+- Window row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
id_row_number#28, id#35, name#36 DESC NULLS LAST
  +- Sort id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST, false, 0
     +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, [plan_id=88]
        +- Union
           :- FileScan csv spark_catalog.default.table1id#35,name#36
           +- FileScan csv spark_catalog.default.table2id#37,name#38 {code}
 

Although the two tables are bucketed by id column, there's still a exchange 
plan after union.The reason is that union plan's output partitioning is null.

We can 

[jira] [Commented] (SPARK-36999) Document the command ALTER TABLE RECOVER PARTITIONS

2023-03-27 Thread Juanvi Soler (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705379#comment-17705379
 ] 

Juanvi Soler commented on SPARK-36999:
--

https://github.com/apache/spark/pull/35239

> Document the command ALTER TABLE RECOVER PARTITIONS
> ---
>
> Key: SPARK-36999
> URL: https://issues.apache.org/jira/browse/SPARK-36999
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>  Labels: starter
>
> Update the page 
> [https://spark.apache.org/docs/3.1.2/sql-ref-syntax-ddl-alter-table.html,] 
> and document the command ALTER TABLE RECOVER PARTITIONS



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36999) Document the command ALTER TABLE RECOVER PARTITIONS

2023-03-27 Thread Juanvi Soler (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705376#comment-17705376
 ] 

Juanvi Soler commented on SPARK-36999:
--

[~maxgekk] can this be closed?

> Document the command ALTER TABLE RECOVER PARTITIONS
> ---
>
> Key: SPARK-36999
> URL: https://issues.apache.org/jira/browse/SPARK-36999
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>  Labels: starter
>
> Update the page 
> [https://spark.apache.org/docs/3.1.2/sql-ref-syntax-ddl-alter-table.html,] 
> and document the command ALTER TABLE RECOVER PARTITIONS



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42935) Optimze shuffle for union spark plan

2023-03-27 Thread Jeff Min (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Min updated SPARK-42935:
-
Description: 
Union plan does not take full advantage of children plan output partitionings 
when output partitoning can't match parent plan's required distribution. For 
example, Table1 and table2 are all bucketed table with bucket column id and 
bucket number 100. We will do row_number window function after union the two 
tables.
{code:sql}
create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO 100 
BUCKETS;
insert into table1 values(1, "s1");
insert into table1 values(2, "s2");
​
create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO 100 
BUCKETS;
insert into table2 values(1, "s3");
​
set spark.sql.shuffle.partitions=100;
explain select *, row_number() over(partition by id order by name desc) 
id_row_number from (select * from table1 union all select * from table2);{code}
The physical plan is 
{code:shell}
Unable to find source-code formatter for language: shell. Available languages 
are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, 
groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, 
php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, 
yamlAdaptiveSparkPlan isFinalPlan=false
+- Window row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
id_row_number#28, id#35, name#36 DESC NULLS LAST
  +- Sort id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST, false, 0
     +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, [plan_id=88]
        +- Union
           :- FileScan csv spark_catalog.default.table1id#35,name#36
           +- FileScan csv spark_catalog.default.table2id#37,name#38 {code}
 

Although the two tables are bucketed by id column, there's still a exchange 
plan after union.The reason is that union plan's output partitioning is null.

We can indroduce a new idea to optimize exchange plan:
 # First introduce a new RDD, it consists of parent rdds that has the same 
partition size. The ith parttition corresponds to ith partition of each parent 
rdd.

 # Then push the required distribution to union plan's children. If any child 
output partitioning matches the required distribution , we can reduce this 
child shuffle operation.

After doing these, the physical plan is
{code:shell}
daptiveSparkPlan isFinalPlan=false
+- Window row_number() windowspecdefinition(id#7, name#8 DESC NULLS LAST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
id_row_number#0, id#7, name#8 DESC NULLS LAST
  +- Sort id#7 ASC NULLS FIRST, name#8 DESC NULLS LAST, false, 0
     +- UnionZip ClusteredDistribution(ArrayBuffer(id#7),false,None), 
ClusteredDistribution(ArrayBuffer(id#9),false,None), hashpartitioning(id#7, 200)
        :- FileScan csv spark_catalog.default.table1id#7,name#8
        +- FileScan csv spark_catalog.default.table2id#9,name#10 {code}


 

 

  was:
Union plan does not take full advantage of children plan output partitionings 
when output partitoning can't match parent plan's required distribution. For 
example, Table1 and table2 are all bucketed table with bucket column id and 
bucket number 100. We will do row_number window function after union the two 
tables.
create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO 100 
BUCKETS;
insert into table1 values(1, "s1");
insert into table1 values(2, "s2");
​
create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO 100 
BUCKETS;
insert into table2 values(1, "s3");
​
set spark.sql.shuffle.partitions=100;
explain select *, row_number() over(partition by id order by name desc) 
id_row_number from (select * from table1 union all select * from table2);
The physical plan is
AdaptiveSparkPlan isFinalPlan=false
+- Window [row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
id_row_number#28], [id#35], [name#36 DESC NULLS LAST]
   +- Sort [id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST], false, 0
      +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, 
[plan_id=88]
         +- Union
            :- FileScan csv spark_catalog.default.table1[id#35,name#36]
            +- FileScan csv spark_catalog.default.table2[id#37,name#38]
Although the two tables are bucketed by id column, there's still a exchange 
plan after union.The reason is that union plan's output partitioning is null.

We can indroduce a new idea to optimize exchange plan:
 # First introduce a new RDD, it consists of parent rdds that has the same 
partition size. The ith parttition corresponds to ith partition of each parent 
rdd.

 # Then push the required distribution to union plan's children. If any child 
output partitioning matches the required 

[jira] [Created] (SPARK-42935) Optimze shuffle for union spark plan

2023-03-27 Thread Jeff Min (Jira)
Jeff Min created SPARK-42935:


 Summary: Optimze shuffle for union spark plan
 Key: SPARK-42935
 URL: https://issues.apache.org/jira/browse/SPARK-42935
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Jeff Min
 Fix For: 3.5.0


Union plan does not take full advantage of children plan output partitionings 
when output partitoning can't match parent plan's required distribution. For 
example, Table1 and table2 are all bucketed table with bucket column id and 
bucket number 100. We will do row_number window function after union the two 
tables.
create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO 100 
BUCKETS;
insert into table1 values(1, "s1");
insert into table1 values(2, "s2");
​
create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO 100 
BUCKETS;
insert into table2 values(1, "s3");
​
set spark.sql.shuffle.partitions=100;
explain select *, row_number() over(partition by id order by name desc) 
id_row_number from (select * from table1 union all select * from table2);
The physical plan is
AdaptiveSparkPlan isFinalPlan=false
+- Window [row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
id_row_number#28], [id#35], [name#36 DESC NULLS LAST]
   +- Sort [id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST], false, 0
      +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, 
[plan_id=88]
         +- Union
            :- FileScan csv spark_catalog.default.table1[id#35,name#36]
            +- FileScan csv spark_catalog.default.table2[id#37,name#38]
Although the two tables are bucketed by id column, there's still a exchange 
plan after union.The reason is that union plan's output partitioning is null.

We can indroduce a new idea to optimize exchange plan:
 # First introduce a new RDD, it consists of parent rdds that has the same 
partition size. The ith parttition corresponds to ith partition of each parent 
rdd.

 # Then push the required distribution to union plan's children. If any child 
output partitioning matches the required distribution , we can reduce this 
child shuffle operation.

After doing these, the physical plan is
daptiveSparkPlan isFinalPlan=false
+- Window [row_number() windowspecdefinition(id#7, name#8 DESC NULLS LAST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
id_row_number#0], [id#7], [name#8 DESC NULLS LAST]
   +- Sort [id#7 ASC NULLS FIRST, name#8 DESC NULLS LAST], false, 0
      +- UnionZip [ClusteredDistribution(ArrayBuffer(id#7),false,None), 
ClusteredDistribution(ArrayBuffer(id#9),false,None)], hashpartitioning(id#7, 
200)
         :- FileScan csv spark_catalog.default.table1[id#7,name#8]
         +- FileScan csv spark_catalog.default.table2[id#9,name#10]
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36999) Document the command ALTER TABLE RECOVER PARTITIONS

2023-03-27 Thread Juanvi Soler (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705355#comment-17705355
 ] 

Juanvi Soler commented on SPARK-36999:
--

This looks already solved 
[here|https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-alter-table.md#recover-partitions]

> Document the command ALTER TABLE RECOVER PARTITIONS
> ---
>
> Key: SPARK-36999
> URL: https://issues.apache.org/jira/browse/SPARK-36999
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>  Labels: starter
>
> Update the page 
> [https://spark.apache.org/docs/3.1.2/sql-ref-syntax-ddl-alter-table.html,] 
> and document the command ALTER TABLE RECOVER PARTITIONS



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42934) Testing OrcEncryptionSuite using maven is always skipped

2023-03-27 Thread Yang Jie (Jira)
Yang Jie created SPARK-42934:


 Summary: Testing OrcEncryptionSuite using maven is always skipped
 Key: SPARK-42934
 URL: https://issues.apache.org/jira/browse/SPARK-42934
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 3.5.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42929) make mapInPandas / mapInArrow support "is_barrier"

2023-03-27 Thread Weichen Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu resolved SPARK-42929.

   Fix Version/s: 3.5.0
Target Version/s: 3.5.0
  Resolution: Fixed

> make mapInPandas / mapInArrow support "is_barrier"
> --
>
> Key: SPARK-42929
> URL: https://issues.apache.org/jira/browse/SPARK-42929
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.5.0
>
>
> make mapInPandas / mapInArrow support "is_barrier"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14516) Clustering evaluator

2023-03-27 Thread Marco Gaido (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069815#comment-16069815
 ] 

Marco Gaido edited comment on SPARK-14516 at 3/27/23 9:42 AM:
--

Hello everybody,

I have a proposal for a very efficient Silhouette implementation in a 
distributed environment. Here you can find the link with all the details of the 
solution. As soon as I will finish all the implementation and the tests I will 
post the PR for this: 
-[https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view|https://drive.google.com/file/d/0B0Hyo__bG_3fdkNvSVNYX2E3ZU0/view].-
 (please refer to https://arxiv.org/abs/2303.14102).

Please tell me if you have any comment, doubt on it.

Thanks.


was (Author: mgaido):
Hello everybody,

I have a proposal for a very efficient Silhouette implementation in a 
distributed environment. Here you can find the link with all the details of the 
solution. As soon as I will finish all the implementation and the tests I will 
post the PR for this: 
https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view.

Please tell me if you have any comment, doubt on it.

Thanks.


> Clustering evaluator
> 
>
> Key: SPARK-14516
> URL: https://issues.apache.org/jira/browse/SPARK-14516
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Ruifeng Zheng
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.0
>
>
> MLlib does not have any general purposed clustering metrics with a ground 
> truth.
> In 
> [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics),
>  there are several kinds of metrics for this.
> It may be meaningful to add some clustering metrics into MLlib.
> This should be added as a {{ClusteringEvaluator}} class of extending 
> {{Evaluator}} in spark.ml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14516) Clustering evaluator

2023-03-27 Thread Marco Gaido (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705265#comment-17705265
 ] 

Marco Gaido commented on SPARK-14516:
-

As there are some issues with the Google Doc sharing, I prepared a paper that 
explains the metric and its definition, please disregard the previous link and 
refer to [https://arxiv.org/abs/2303.14102] if you are interested. It would be 
also easier for me to update it in the future if needed. Thanks.

> Clustering evaluator
> 
>
> Key: SPARK-14516
> URL: https://issues.apache.org/jira/browse/SPARK-14516
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Ruifeng Zheng
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.0
>
>
> MLlib does not have any general purposed clustering metrics with a ground 
> truth.
> In 
> [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics),
>  there are several kinds of metrics for this.
> It may be meaningful to add some clustering metrics into MLlib.
> This should be added as a {{ClusteringEvaluator}} class of extending 
> {{Evaluator}} in spark.ml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39722) Make Dataset.showString() public

2023-03-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705257#comment-17705257
 ] 

ASF GitHub Bot commented on SPARK-39722:


User 'VindhyaG' has created a pull request for this issue:
https://github.com/apache/spark/pull/40553

> Make Dataset.showString() public
> 
>
> Key: SPARK-39722
> URL: https://issues.apache.org/jira/browse/SPARK-39722
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.8, 3.3.0
>Reporter: Jatin Sharma
>Priority: Trivial
>
> Currently, we have {{.show}} APIs on a Dataset, but they print directly to 
> stdout.
> But there are a lot of cases where we might need to get a String 
> representation of the show output. For example
>  * We have a logging framework to which we need to push the representation of 
> a df
>  * We have to send the string over a REST call from the driver
>  * We want to send the string to stderr instead of stdout
> For such cases, currently one needs to do a hack by changing the Console.out 
> temporarily and catching the representation in a ByteArrayOutputStream or 
> similar, then extracting the string from it.
> Strictly only printing to stdout seems like a limiting choice. 
>  
> Solution:
> We expose APIs to return the String representation back. We already have the 
> .{{{}showString{}}} method internally.
>  
> We could mirror the current {{.show}} APIS with a corresponding 
> {{.showString}} (and rename the internal private function to something else 
> if required)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42931) dropDuplicates within watermark

2023-03-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705252#comment-17705252
 ] 

ASF GitHub Bot commented on SPARK-42931:


User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/40561

> dropDuplicates within watermark
> ---
>
> Key: SPARK-42931
> URL: https://issues.apache.org/jira/browse/SPARK-42931
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Jungtaek Lim
>Priority: Major
> Attachments: [External] Mini design doc_ dropDuplicates within 
> watermark.pdf
>
>
> We got many reports that dropDuplicates does not clean up the state even 
> though they have set a watermark for the query. We document the behavior 
> clearly that the event time column should be a part of the subset columns for 
> deduplication to clean up the state, but it cannot be applied to the 
> customers as timestamps are not exactly the same for duplicated events in 
> their use cases.
> We propose to deduce a new API of dropDuplicates which has following 
> different characteristics compared to existing dropDuplicates:
>  * Weaker constraints on the subset (key)
>  ** Does not require an event time column on the subset.
>  * Looser semantics on deduplication
>  ** Only guarantee to deduplicate events within the watermark.
> Since the new API leverages event time, the new API has following new 
> requirements:
>  * The input must be streaming DataFrame.
>  * The watermark must be defined.
>  * The event time column must be defined in the input DataFrame.
> More specifically on the semantic, once the operator processes the first 
> arrived event, events arriving within the watermark for the first event will 
> be deduplicated.
> (Technically, the expiration time should be the “event time of the first 
> arrived event + watermark delay threshold”, to match up with future events.)
> Users are encouraged to set the delay threshold of watermark longer than max 
> timestamp differences among duplicated events. (If they are unsure, they can 
> alternatively set the delay threshold large enough, e.g. 48 hours.)
> Longer design doc will be attached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42933) PySpark: dropDuplicates within watermark

2023-03-27 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-42933:


 Summary: PySpark: dropDuplicates within watermark
 Key: SPARK-42933
 URL: https://issues.apache.org/jira/browse/SPARK-42933
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Structured Streaming
Affects Versions: 3.5.0
Reporter: Jungtaek Lim


This is a follow-up issue for SPARK-42931 to ease of review for SPARK-42931.

Add equivalent public API where we add in SPARK-42931 to PySpark, and also add 
a code example for PySpark in the guide doc.

Once we merge SPARK-42931, this ticket should be ideally marked as blocker.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42932) Spark 3.3.2, with hadoop3, Error with java.io.IOException: Mkdirs failed to create file

2023-03-27 Thread shamim (Jira)
shamim created SPARK-42932:
--

 Summary: Spark 3.3.2, with hadoop3,  Error with 
java.io.IOException: Mkdirs failed to create file
 Key: SPARK-42932
 URL: https://issues.apache.org/jira/browse/SPARK-42932
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Spark Submit
Affects Versions: 3.3.2
Reporter: shamim


We are using spark 3.3.2 with hadoop 3  coming with spark.

[https://www.apache.org/dyn/closer.lua/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz]
 

[https://www.apache.org/dyn/closer.lua/spark/spark-3.3.2/spark-3.3.2-bin-hadoop2.tgz]
 

Spark in our application is used as standalone , and we are not using HDFS file 
system.

Spark is writing on local file system.

Same spark version 3.3.2 is working fine with hadoop 2. but with hadoop 3 , we 
are getting below issue. 

 

23/03/18 20:23:24 WARN TaskSetManager: Lost task 4.0 in stage 0.0 (TID 4) 
(10.64.109.72 executor 0): java.io.IOException: Mkdirs failed to create 
[file:/var/backup/_temporary/0/_temporary/attempt_202301182023173234741341853025716_0005_m_04_0|file://var/backup/_temporary/0/_temporary/attempt_202301182023173234741341853025716_0005_m_04_0]
 (exists=false, 
cwd=[file:/opt/spark-3.3.2/work/app-20230118202317-0001/0|file://opt/spark-3.3.0/work/app-20230118202317-0001/0])
        at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:515)
        at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:500)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1195)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1081)
        at 
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:113)
        at 
org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil.initWriter(SparkHadoopWriter.scala:238)
        at 
org.apache.spark.internal.io.SparkHadoopWriter$.executeTask(SparkHadoopWriter.scala:126)
        at 
org.apache.spark.internal.io.SparkHadoopWriter$.$anonfun$write$1(SparkHadoopWriter.scala:88)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750) 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42931) dropDuplicates within watermark

2023-03-27 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-42931:
-
Attachment: [External] Mini design doc_ dropDuplicates within watermark.pdf

> dropDuplicates within watermark
> ---
>
> Key: SPARK-42931
> URL: https://issues.apache.org/jira/browse/SPARK-42931
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Jungtaek Lim
>Priority: Major
> Attachments: [External] Mini design doc_ dropDuplicates within 
> watermark.pdf
>
>
> We got many reports that dropDuplicates does not clean up the state even 
> though they have set a watermark for the query. We document the behavior 
> clearly that the event time column should be a part of the subset columns for 
> deduplication to clean up the state, but it cannot be applied to the 
> customers as timestamps are not exactly the same for duplicated events in 
> their use cases.
> We propose to deduce a new API of dropDuplicates which has following 
> different characteristics compared to existing dropDuplicates:
>  * Weaker constraints on the subset (key)
>  ** Does not require an event time column on the subset.
>  * Looser semantics on deduplication
>  ** Only guarantee to deduplicate events within the watermark.
> Since the new API leverages event time, the new API has following new 
> requirements:
>  * The input must be streaming DataFrame.
>  * The watermark must be defined.
>  * The event time column must be defined in the input DataFrame.
> More specifically on the semantic, once the operator processes the first 
> arrived event, events arriving within the watermark for the first event will 
> be deduplicated.
> (Technically, the expiration time should be the “event time of the first 
> arrived event + watermark delay threshold”, to match up with future events.)
> Users are encouraged to set the delay threshold of watermark longer than max 
> timestamp differences among duplicated events. (If they are unsure, they can 
> alternatively set the delay threshold large enough, e.g. 48 hours.)
> Longer design doc will be attached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42930) Change the access scope of `ProtobufSerDe` related implementations to `private[spark]`

2023-03-27 Thread Yang Jie (Jira)
Yang Jie created SPARK-42930:


 Summary: Change the access scope of `ProtobufSerDe` related 
implementations to `private[spark]`
 Key: SPARK-42930
 URL: https://issues.apache.org/jira/browse/SPARK-42930
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0, 3.5.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42931) dropDuplicates within watermark

2023-03-27 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-42931:


 Summary: dropDuplicates within watermark
 Key: SPARK-42931
 URL: https://issues.apache.org/jira/browse/SPARK-42931
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 3.5.0
Reporter: Jungtaek Lim


We got many reports that dropDuplicates does not clean up the state even though 
they have set a watermark for the query. We document the behavior clearly that 
the event time column should be a part of the subset columns for deduplication 
to clean up the state, but it cannot be applied to the customers as timestamps 
are not exactly the same for duplicated events in their use cases.

We propose to deduce a new API of dropDuplicates which has following different 
characteristics compared to existing dropDuplicates:
 * Weaker constraints on the subset (key)
 ** Does not require an event time column on the subset.
 * Looser semantics on deduplication
 ** Only guarantee to deduplicate events within the watermark.

Since the new API leverages event time, the new API has following new 
requirements:
 * The input must be streaming DataFrame.
 * The watermark must be defined.
 * The event time column must be defined in the input DataFrame.

More specifically on the semantic, once the operator processes the first 
arrived event, events arriving within the watermark for the first event will be 
deduplicated.
(Technically, the expiration time should be the “event time of the first 
arrived event + watermark delay threshold”, to match up with future events.)

Users are encouraged to set the delay threshold of watermark longer than max 
timestamp differences among duplicated events. (If they are unsure, they can 
alternatively set the delay threshold large enough, e.g. 48 hours.)

Longer design doc will be attached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42929) make mapInPandas / mapInArrow support "is_barrier"

2023-03-27 Thread Weichen Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu reassigned SPARK-42929:
--

Assignee: Weichen Xu

> make mapInPandas / mapInArrow support "is_barrier"
> --
>
> Key: SPARK-42929
> URL: https://issues.apache.org/jira/browse/SPARK-42929
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>
> make mapInPandas / mapInArrow support "is_barrier"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42929) make mapInPandas / mapInArrow support "is_barrier"

2023-03-27 Thread Weichen Xu (Jira)
Weichen Xu created SPARK-42929:
--

 Summary: make mapInPandas / mapInArrow support "is_barrier"
 Key: SPARK-42929
 URL: https://issues.apache.org/jira/browse/SPARK-42929
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.5.0
Reporter: Weichen Xu


make mapInPandas / mapInArrow support "is_barrier"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org