[jira] [Resolved] (SPARK-42908) Raise RuntimeError if SparkContext is not initialized when parsing DDL-formatted type strings
[ https://issues.apache.org/jira/browse/SPARK-42908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42908. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40534 [https://github.com/apache/spark/pull/40534] > Raise RuntimeError if SparkContext is not initialized when parsing > DDL-formatted type strings > - > > Key: SPARK-42908 > URL: https://issues.apache.org/jira/browse/SPARK-42908 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0, 3.5.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Raise RuntimeError if SparkContext is not initialized when parsing > DDL-formatted type strings. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42908) Raise RuntimeError if SparkContext is not initialized when parsing DDL-formatted type strings
[ https://issues.apache.org/jira/browse/SPARK-42908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42908: Assignee: Xinrong Meng > Raise RuntimeError if SparkContext is not initialized when parsing > DDL-formatted type strings > - > > Key: SPARK-42908 > URL: https://issues.apache.org/jira/browse/SPARK-42908 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0, 3.5.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > > Raise RuntimeError if SparkContext is not initialized when parsing > DDL-formatted type strings. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37677) spark on k8s, when the user want to push python3.6.6.zip to the pod , but no permission to execute
[ https://issues.apache.org/jira/browse/SPARK-37677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-37677: Assignee: jingxiong zhong > spark on k8s, when the user want to push python3.6.6.zip to the pod , but no > permission to execute > -- > > Key: SPARK-37677 > URL: https://issues.apache.org/jira/browse/SPARK-37677 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: jingxiong zhong >Assignee: jingxiong zhong >Priority: Major > > In cluster mode, I hava another question that when I unzip python3.6.6.zip in > pod , but no permission to execute, my execute operation as follows: > {code:sh} > spark-submit \ > --archives ./python3.6.6.zip#python3.6.6 \ > --conf "spark.pyspark.python=python3.6.6/python3.6.6/bin/python3" \ > --conf "spark.pyspark.driver.python=python3.6.6/python3.6.6/bin/python3" \ > --conf spark.kubernetes.container.image.pullPolicy=Always \ > ./examples/src/main/python/pi.py 100 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37677) spark on k8s, when the user want to push python3.6.6.zip to the pod , but no permission to execute
[ https://issues.apache.org/jira/browse/SPARK-37677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37677. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40572 [https://github.com/apache/spark/pull/40572] > spark on k8s, when the user want to push python3.6.6.zip to the pod , but no > permission to execute > -- > > Key: SPARK-37677 > URL: https://issues.apache.org/jira/browse/SPARK-37677 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: jingxiong zhong >Assignee: jingxiong zhong >Priority: Major > Fix For: 3.5.0 > > > In cluster mode, I hava another question that when I unzip python3.6.6.zip in > pod , but no permission to execute, my execute operation as follows: > {code:sh} > spark-submit \ > --archives ./python3.6.6.zip#python3.6.6 \ > --conf "spark.pyspark.python=python3.6.6/python3.6.6/bin/python3" \ > --conf "spark.pyspark.driver.python=python3.6.6/python3.6.6/bin/python3" \ > --conf spark.kubernetes.container.image.pullPolicy=Always \ > ./examples/src/main/python/pi.py 100 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42944) Support foreachBatch() in streaming spark connect
Raghu Angadi created SPARK-42944: Summary: Support foreachBatch() in streaming spark connect Key: SPARK-42944 URL: https://issues.apache.org/jira/browse/SPARK-42944 Project: Spark Issue Type: Task Components: Connect, Structured Streaming Affects Versions: 3.5.0 Reporter: Raghu Angadi Add support for foreachBatch() streaming spark connect. This might need deep dive into various complexities of arbitrary spark code since foreachBatch block. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42943) Use LONGTEXT instead of TEXT for StringType
[ https://issues.apache.org/jira/browse/SPARK-42943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705791#comment-17705791 ] Snoot.io commented on SPARK-42943: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/40573 > Use LONGTEXT instead of TEXT for StringType > --- > > Key: SPARK-42943 > URL: https://issues.apache.org/jira/browse/SPARK-42943 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kent Yao >Priority: Major > > MysqlDataTruncation will be thrown if the string length exceeds 65535 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42937) Join with subquery in condition can fail with wholestage codegen and adaptive execution disabled
[ https://issues.apache.org/jira/browse/SPARK-42937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705790#comment-17705790 ] Snoot.io commented on SPARK-42937: -- User 'bersprockets' has created a pull request for this issue: https://github.com/apache/spark/pull/40569 > Join with subquery in condition can fail with wholestage codegen and adaptive > execution disabled > > > Key: SPARK-42937 > URL: https://issues.apache.org/jira/browse/SPARK-42937 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.0, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > > The below left outer join gets an error: > {noformat} > create or replace temp view v1 as > select * from values > (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), > (2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), > (3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) > as v1(key, value1, value2, value3, value4, value5, value6, value7, value8, > value9, value10); > create or replace temp view v2 as > select * from values > (1, 2), > (3, 8), > (7, 9) > as v2(a, b); > create or replace temp view v3 as > select * from values > (3), > (8) > as v3(col1); > set spark.sql.codegen.maxFields=10; -- let's make maxFields 10 instead of 100 > set spark.sql.adaptive.enabled=false; > select * > from v1 > left outer join v2 > on key = a > and key in (select col1 from v3); > {noformat} > The join fails during predicate codegen: > {noformat} > 23/03/27 12:24:12 WARN Predicate: Expr codegen error and falling back to > interpreter mode > java.lang.IllegalArgumentException: requirement failed: input[0, int, false] > IN subquery#34 has not finished > at scala.Predef$.require(Predef.scala:281) > at > org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144) > at > org.apache.spark.sql.execution.InSubqueryExec.doGenCode(subquery.scala:156) > at > org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:201) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:196) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.$anonfun$generateExpressions$2(CodeGenerator.scala:1278) > at scala.collection.immutable.List.map(List.scala:293) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.generateExpressions(CodeGenerator.scala:1278) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:41) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:33) > at > org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:73) > at > org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:70) > at > org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:51) > at > org.apache.spark.sql.catalyst.expressions.Predicate$.create(predicates.scala:86) > at > org.apache.spark.sql.execution.joins.HashJoin.boundCondition(HashJoin.scala:146) > at > org.apache.spark.sql.execution.joins.HashJoin.boundCondition$(HashJoin.scala:140) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition$lzycompute(BroadcastHashJoinExec.scala:40) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition(BroadcastHashJoinExec.scala:40) > {noformat} > It fails again after fallback to interpreter mode: > {noformat} > 23/03/27 12:24:12 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7) > java.lang.IllegalArgumentException: requirement failed: input[0, int, false] > IN subquery#34 has not finished > at scala.Predef$.require(Predef.scala:281) > at > org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144) > at > org.apache.spark.sql.execution.InSubqueryExec.eval(subquery.scala:151) > at > org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:52) > at > org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2(HashJoin.scala:146) > at > org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2$adapted(HashJoin.scala:146) > at > org.apache.spark.sql.execution.joins.HashJoin.$anonfun$outerJoin$1(HashJoin.scala:205) > {noformat} > Both the predicate codegen and the evaluation fail for the same reason: > {{PlanSubqueries}} creates {{InSubqueryExec}} with {{shouldBroadcast=false}}. > The driver waits for the subquery to finish, but it's the executor that uses > the
[jira] [Commented] (SPARK-41233) High-order function: array_prepend
[ https://issues.apache.org/jira/browse/SPARK-41233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705789#comment-17705789 ] Snoot.io commented on SPARK-41233: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/40563 > High-order function: array_prepend > -- > > Key: SPARK-41233 > URL: https://issues.apache.org/jira/browse/SPARK-41233 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.5.0 > > > refer to > https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_prepend.html > 1, about the data type validation: > In Snowflake’s array_append, array_prepend and array_insert functions, the > element data type does not need to match the data type of the existing > elements in the array. > While in Spark, we want to leverage the same data type validation as > array_remove. > 2, about the NULL handling > Currently, SparkSQL, SnowSQL and PostgreSQL deal with NULL values in > different ways. > Existing functions array_contains, array_position and array_remove in > SparkSQL handle NULL in this way, if the input array or/and element is NULL, > returns NULL. However, this behavior should be broken. > We should implement the NULL handling in array_prepend in this way: > 2.1, if the array is NULL, returns NULL; > 2.2 if the array is not NULL, the element is NULL, append the NULL value into > the array -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42922) Use SecureRandom, instead of Random in security sensitive contexts
[ https://issues.apache.org/jira/browse/SPARK-42922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-42922. -- Fix Version/s: 3.3.3 3.4.1 3.5.0 Assignee: Mridul Muralidharan Resolution: Fixed Resolved by https://github.com/apache/spark/pull/40568 > Use SecureRandom, instead of Random in security sensitive contexts > -- > > Key: SPARK-42922 > URL: https://issues.apache.org/jira/browse/SPARK-42922 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.3, 3.3.2, 3.4.0, 3.5.0 >Reporter: Mridul Muralidharan >Assignee: Mridul Muralidharan >Priority: Minor > Fix For: 3.3.3, 3.4.1, 3.5.0 > > > Most uses of Random in spark are either in test cases or where we need a > pseudo random number which is repeatable. > The following are usages where moving from Random to SecureRandom would be > useful > a) HttpAuthUtils.createCookieToken > b) ThriftHttpServlet.RAN -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42922) Use SecureRandom, instead of Random in security sensitive contexts
[ https://issues.apache.org/jira/browse/SPARK-42922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-42922: - Priority: Minor (was: Major) > Use SecureRandom, instead of Random in security sensitive contexts > -- > > Key: SPARK-42922 > URL: https://issues.apache.org/jira/browse/SPARK-42922 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.3, 3.3.2, 3.4.0, 3.5.0 >Reporter: Mridul Muralidharan >Priority: Minor > > Most uses of Random in spark are either in test cases or where we need a > pseudo random number which is repeatable. > The following are usages where moving from Random to SecureRandom would be > useful > a) HttpAuthUtils.createCookieToken > b) ThriftHttpServlet.RAN -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41876) Implement DataFrame `toLocalIterator`
[ https://issues.apache.org/jira/browse/SPARK-41876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41876: Assignee: Takuya Ueshin > Implement DataFrame `toLocalIterator` > - > > Key: SPARK-41876 > URL: https://issues.apache.org/jira/browse/SPARK-41876 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Takuya Ueshin >Priority: Major > > {code:java} > schema = StructType( > [StructField("i", StringType(), True), StructField("j", IntegerType(), > True)] > ) > df = self.spark.createDataFrame([("a", 1)], schema) > schema1 = StructType([StructField("j", StringType()), StructField("i", > StringType())]) > df1 = df.to(schema1) > self.assertEqual(schema1, df1.schema) > self.assertEqual(df.count(), df1.count()) > schema2 = StructType([StructField("j", LongType())]) > df2 = df.to(schema2) > self.assertEqual(schema2, df2.schema) > self.assertEqual(df.count(), df2.count()) > schema3 = StructType([StructField("struct", schema1, False)]) > df3 = df.select(struct("i", "j").alias("struct")).to(schema3) > self.assertEqual(schema3, df3.schema) > self.assertEqual(df.count(), df3.count()) > # incompatible field nullability > schema4 = StructType([StructField("j", LongType(), False)]) > self.assertRaisesRegex( > AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4) > ){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 1486, in test_to > self.assertRaisesRegex( > AssertionError: AnalysisException not raised by {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41876) Implement DataFrame `toLocalIterator`
[ https://issues.apache.org/jira/browse/SPARK-41876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41876. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40570 [https://github.com/apache/spark/pull/40570] > Implement DataFrame `toLocalIterator` > - > > Key: SPARK-41876 > URL: https://issues.apache.org/jira/browse/SPARK-41876 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.4.0 > > > {code:java} > schema = StructType( > [StructField("i", StringType(), True), StructField("j", IntegerType(), > True)] > ) > df = self.spark.createDataFrame([("a", 1)], schema) > schema1 = StructType([StructField("j", StringType()), StructField("i", > StringType())]) > df1 = df.to(schema1) > self.assertEqual(schema1, df1.schema) > self.assertEqual(df.count(), df1.count()) > schema2 = StructType([StructField("j", LongType())]) > df2 = df.to(schema2) > self.assertEqual(schema2, df2.schema) > self.assertEqual(df.count(), df2.count()) > schema3 = StructType([StructField("struct", schema1, False)]) > df3 = df.select(struct("i", "j").alias("struct")).to(schema3) > self.assertEqual(schema3, df3.schema) > self.assertEqual(df.count(), df3.count()) > # incompatible field nullability > schema4 = StructType([StructField("j", LongType(), False)]) > self.assertRaisesRegex( > AnalysisException, "NULLABLE_COLUMN_OR_FIELD", lambda: df.to(schema4) > ){code} > {code:java} > Traceback (most recent call last): > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_dataframe.py", > line 1486, in test_to > self.assertRaisesRegex( > AssertionError: AnalysisException not raised by {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42895) ValueError when invoking any session operations on a stopped Spark session
[ https://issues.apache.org/jira/browse/SPARK-42895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705782#comment-17705782 ] Snoot.io commented on SPARK-42895: -- User 'allisonwang-db' has created a pull request for this issue: https://github.com/apache/spark/pull/40536 > ValueError when invoking any session operations on a stopped Spark session > -- > > Key: SPARK-42895 > URL: https://issues.apache.org/jira/browse/SPARK-42895 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Allison Wang >Priority: Major > > If a remote Spark session is stopped, trying to invoke any session operations > will result in a ValueError. For example: > > {code:java} > spark.stop() > spark.sql("select 1") > ValueError: Cannot invoke RPC: Channel closed! > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > ... > return e.code() == grpc.StatusCode.UNAVAILABLE > AttributeError: 'ValueError' object has no attribute 'code'{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42943) Use LONGTEXT instead of TEXT for StringType
Kent Yao created SPARK-42943: Summary: Use LONGTEXT instead of TEXT for StringType Key: SPARK-42943 URL: https://issues.apache.org/jira/browse/SPARK-42943 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Kent Yao MysqlDataTruncation will be thrown if the string length exceeds 65535 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37677) spark on k8s, when the user want to push python3.6.6.zip to the pod , but no permission to execute
[ https://issues.apache.org/jira/browse/SPARK-37677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705781#comment-17705781 ] Snoot.io commented on SPARK-37677: -- User 'smallzhongfeng' has created a pull request for this issue: https://github.com/apache/spark/pull/40572 > spark on k8s, when the user want to push python3.6.6.zip to the pod , but no > permission to execute > -- > > Key: SPARK-37677 > URL: https://issues.apache.org/jira/browse/SPARK-37677 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: jingxiong zhong >Priority: Major > > In cluster mode, I hava another question that when I unzip python3.6.6.zip in > pod , but no permission to execute, my execute operation as follows: > {code:sh} > spark-submit \ > --archives ./python3.6.6.zip#python3.6.6 \ > --conf "spark.pyspark.python=python3.6.6/python3.6.6/bin/python3" \ > --conf "spark.pyspark.driver.python=python3.6.6/python3.6.6/bin/python3" \ > --conf spark.kubernetes.container.image.pullPolicy=Always \ > ./examples/src/main/python/pi.py 100 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38200) [SQL] Spark JDBC Savemode Supports Upsert
[ https://issues.apache.org/jira/browse/SPARK-38200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] melin updated SPARK-38200: -- Description: upsert sql for different databases, Most databases support merge sql: sqlserver merge into sql : [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/sqlserver/SqlServerDialect.java] mysql: [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/mysql/MysqlDialect.java] oracle merge into sql : [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/oracle/OracleDialect.java] postgres: [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/psql/PostgresDialect.java] postgres merg into sql : [https://www.postgresql.org/docs/current/sql-merge.html] db2 merge into sql : [https://www.ibm.com/docs/en/db2-for-zos/12?topic=statements-merge] derby merge into sql: [https://db.apache.org/derby/docs/10.14/ref/rrefsqljmerge.html] he merg into sql : [https://www.tutorialspoint.com/h2_database/h2_database_merge.htm] [~beliefer] [~cloud_fan] was: When writing data into a relational database, data duplication needs to be considered. Both mysql and postgres support upsert syntax. mysql: {code:java} replace into t(id, update_time) values(1, now()); {code} pg: {code:java} INSERT INTO %s (id,name,data_time,remark) VALUES ( ?,?,?,? ) ON CONFLICT (id,name) DO UPDATE SET id=excluded.id,name=excluded.name,data_time=excluded.data_time,remark=excluded.remark {code} > [SQL] Spark JDBC Savemode Supports Upsert > - > > Key: SPARK-38200 > URL: https://issues.apache.org/jira/browse/SPARK-38200 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: melin >Priority: Major > > upsert sql for different databases, Most databases support merge sql: > sqlserver merge into sql : > [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/sqlserver/SqlServerDialect.java] > mysql: > [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/mysql/MysqlDialect.java] > oracle merge into sql : > [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/oracle/OracleDialect.java] > postgres: > [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/psql/PostgresDialect.java] > postgres merg into sql : > [https://www.postgresql.org/docs/current/sql-merge.html] > db2 merge into sql : > [https://www.ibm.com/docs/en/db2-for-zos/12?topic=statements-merge] > derby merge into sql: > [https://db.apache.org/derby/docs/10.14/ref/rrefsqljmerge.html] > he merg into sql : > [https://www.tutorialspoint.com/h2_database/h2_database_merge.htm] > [~beliefer] [~cloud_fan] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-38200) [SQL] Spark JDBC Savemode Supports Upsert
[ https://issues.apache.org/jira/browse/SPARK-38200 ] melin deleted comment on SPARK-38200: --- was (Author: melin): upsert sql for different databases, Most databases support merge sql: sqlserver merge into sql : [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/sqlserver/SqlServerDialect.java] mysql: [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/mysql/MysqlDialect.java] oracle merge into sql : [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/oracle/OracleDialect.java] postgres: [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/psql/PostgresDialect.java] postgres merg into sql : [https://www.postgresql.org/docs/current/sql-merge.html] db2 merge into sql : [https://www.ibm.com/docs/en/db2-for-zos/12?topic=statements-merge] derby merge into sql: [https://db.apache.org/derby/docs/10.14/ref/rrefsqljmerge.html] he merg into sql : [https://www.tutorialspoint.com/h2_database/h2_database_merge.htm] [~beliefer] [~cloud_fan] > [SQL] Spark JDBC Savemode Supports Upsert > - > > Key: SPARK-38200 > URL: https://issues.apache.org/jira/browse/SPARK-38200 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: melin >Priority: Major > > When writing data into a relational database, data duplication needs to be > considered. Both mysql and postgres support upsert syntax. > mysql: > {code:java} > replace into t(id, update_time) values(1, now()); {code} > pg: > {code:java} > INSERT INTO %s (id,name,data_time,remark) VALUES ( ?,?,?,? ) ON CONFLICT > (id,name) DO UPDATE SET > id=excluded.id,name=excluded.name,data_time=excluded.data_time,remark=excluded.remark > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38200) [SQL] Spark JDBC Savemode Supports Upsert
[ https://issues.apache.org/jira/browse/SPARK-38200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] melin updated SPARK-38200: -- Summary: [SQL] Spark JDBC Savemode Supports Upsert (was: [SQL] Spark JDBC Savemode Supports replace) > [SQL] Spark JDBC Savemode Supports Upsert > - > > Key: SPARK-38200 > URL: https://issues.apache.org/jira/browse/SPARK-38200 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: melin >Priority: Major > > When writing data into a relational database, data duplication needs to be > considered. Both mysql and postgres support upsert syntax. > mysql: > {code:java} > replace into t(id, update_time) values(1, now()); {code} > pg: > {code:java} > INSERT INTO %s (id,name,data_time,remark) VALUES ( ?,?,?,? ) ON CONFLICT > (id,name) DO UPDATE SET > id=excluded.id,name=excluded.name,data_time=excluded.data_time,remark=excluded.remark > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41630) Support lateral column alias in Project code path
[ https://issues.apache.org/jira/browse/SPARK-41630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-41630. - Fix Version/s: 3.4.0 Assignee: Xinyi Yu Resolution: Fixed > Support lateral column alias in Project code path > - > > Key: SPARK-41630 > URL: https://issues.apache.org/jira/browse/SPARK-41630 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Xinyi Yu >Assignee: Xinyi Yu >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-38200) [SQL] Spark JDBC Savemode Supports replace
[ https://issues.apache.org/jira/browse/SPARK-38200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705765#comment-17705765 ] melin edited comment on SPARK-38200 at 3/28/23 3:00 AM: upsert sql for different databases, Most databases support merge sql: sqlserver merge into sql : [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/sqlserver/SqlServerDialect.java] mysql: [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/mysql/MysqlDialect.java] oracle merge into sql : [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/oracle/OracleDialect.java] postgres: [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/psql/PostgresDialect.java] postgres merg into sql : [https://www.postgresql.org/docs/current/sql-merge.html] db2 merge into sql : [https://www.ibm.com/docs/en/db2-for-zos/12?topic=statements-merge] derby merge into sql: [https://db.apache.org/derby/docs/10.14/ref/rrefsqljmerge.html] he merg into sql : [https://www.tutorialspoint.com/h2_database/h2_database_merge.htm] [~beliefer] [~cloud_fan] was (Author: melin): upsert sql for different databases, Most databases support merge sql: sqlserver merge into sql : [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/sqlserver/SqlServerDialect.java] mysql: [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/mysql/MysqlDialect.java] oracle merge into sql : [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/oracle/OracleDialect.java] postgres: [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/psql/PostgresDialect.java] db2 merge into sql : [https://www.ibm.com/docs/en/db2-for-zos/12?topic=statements-merge] derby merge into sql: [https://db.apache.org/derby/docs/10.14/ref/rrefsqljmerge.html] he merg into sql : [https://www.tutorialspoint.com/h2_database/h2_database_merge.htm] [~beliefer] [~cloud_fan] > [SQL] Spark JDBC Savemode Supports replace > -- > > Key: SPARK-38200 > URL: https://issues.apache.org/jira/browse/SPARK-38200 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: melin >Priority: Major > > When writing data into a relational database, data duplication needs to be > considered. Both mysql and postgres support upsert syntax. > mysql: > {code:java} > replace into t(id, update_time) values(1, now()); {code} > pg: > {code:java} > INSERT INTO %s (id,name,data_time,remark) VALUES ( ?,?,?,? ) ON CONFLICT > (id,name) DO UPDATE SET > id=excluded.id,name=excluded.name,data_time=excluded.data_time,remark=excluded.remark > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38200) [SQL] Spark JDBC Savemode Supports replace
[ https://issues.apache.org/jira/browse/SPARK-38200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705765#comment-17705765 ] melin commented on SPARK-38200: --- upsert sql for different databases, Most databases support merge sql: sqlserver merge into sql : [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/sqlserver/SqlServerDialect.java] mysql: [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/mysql/MysqlDialect.java] oracle merge into sql : [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/oracle/OracleDialect.java] postgres: [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/psql/PostgresDialect.java] db2 merge into sql : [https://www.ibm.com/docs/en/db2-for-zos/12?topic=statements-merge] derby merge into sql: [https://db.apache.org/derby/docs/10.14/ref/rrefsqljmerge.html] he merg into sql : [https://www.tutorialspoint.com/h2_database/h2_database_merge.htm] [~beliefer] [~cloud_fan] > [SQL] Spark JDBC Savemode Supports replace > -- > > Key: SPARK-38200 > URL: https://issues.apache.org/jira/browse/SPARK-38200 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: melin >Priority: Major > > When writing data into a relational database, data duplication needs to be > considered. Both mysql and postgres support upsert syntax. > mysql: > {code:java} > replace into t(id, update_time) values(1, now()); {code} > pg: > {code:java} > INSERT INTO %s (id,name,data_time,remark) VALUES ( ?,?,?,? ) ON CONFLICT > (id,name) DO UPDATE SET > id=excluded.id,name=excluded.name,data_time=excluded.data_time,remark=excluded.remark > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42935) Optimze shuffle for union spark plan
[ https://issues.apache.org/jira/browse/SPARK-42935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-42935: Fix Version/s: (was: 3.5.0) > Optimze shuffle for union spark plan > > > Key: SPARK-42935 > URL: https://issues.apache.org/jira/browse/SPARK-42935 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Jeff Min >Priority: Major > Labels: pull-request-available > > Union plan does not take full advantage of children plan output partitionings > when output partitoning can't match parent plan's required distribution. For > example, Table1 and table2 are all bucketed table with bucket column id and > bucket number 100. We will do row_number window function after union the two > tables. > {code:sql} > create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO > 100 BUCKETS; > insert into table1 values(1, "s1"); > insert into table1 values(2, "s2"); > > create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO > 100 BUCKETS; > insert into table2 values(1, "s3"); > > set spark.sql.shuffle.partitions=100; > explain select *, row_number() over(partition by id order by name desc) > id_row_number from (select * from table1 union all select * from > table2);{code} > The physical plan is > {code:bash} > AdaptiveSparkPlan isFinalPlan=false > +- Window row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, > specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS > id_row_number#28, id#35, name#36 DESC NULLS LAST > +- Sort id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST, false, 0 > +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, > [plan_id=88] > +- Union > :- FileScan csv spark_catalog.default.table1id#35,name#36 > +- FileScan csv spark_catalog.default.table2id#37,name#38 {code} > > Although the two tables are bucketed by id column, there's still a exchange > plan after union.The reason is that union plan's output partitioning is null. > We can indroduce a new idea to optimize exchange plan: > # First introduce a new RDD, it consists of parent rdds that has the same > partition size. The ith parttition corresponds to ith partition of each > parent rdd. > # Then push the required distribution to union plan's children. If any child > output partitioning matches the required distribution , we can reduce this > child shuffle operation. > After doing these, the physical plan does not contain exchange shuffle plan > {code:bash} > AdaptiveSparkPlan isFinalPlan=false > +- Window row_number() windowspecdefinition(id#7, name#8 DESC NULLS LAST, > specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS > id_row_number#0, id#7, name#8 DESC NULLS LAST > +- Sort id#7 ASC NULLS FIRST, name#8 DESC NULLS LAST, false, 0 > +- UnionZip ClusteredDistribution(ArrayBuffer(id#7),false,None), > ClusteredDistribution(ArrayBuffer(id#9),false,None), hashpartitioning(id#7, > 200) > :- FileScan csv spark_catalog.default.table1id#7,name#8 > +- FileScan csv spark_catalog.default.table2id#9,name#10 {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42935) Optimze shuffle for union spark plan
[ https://issues.apache.org/jira/browse/SPARK-42935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Min updated SPARK-42935: - Labels: pull-request-available (was: ) > Optimze shuffle for union spark plan > > > Key: SPARK-42935 > URL: https://issues.apache.org/jira/browse/SPARK-42935 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Jeff Min >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > Union plan does not take full advantage of children plan output partitionings > when output partitoning can't match parent plan's required distribution. For > example, Table1 and table2 are all bucketed table with bucket column id and > bucket number 100. We will do row_number window function after union the two > tables. > {code:sql} > create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO > 100 BUCKETS; > insert into table1 values(1, "s1"); > insert into table1 values(2, "s2"); > > create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO > 100 BUCKETS; > insert into table2 values(1, "s3"); > > set spark.sql.shuffle.partitions=100; > explain select *, row_number() over(partition by id order by name desc) > id_row_number from (select * from table1 union all select * from > table2);{code} > The physical plan is > {code:bash} > AdaptiveSparkPlan isFinalPlan=false > +- Window row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, > specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS > id_row_number#28, id#35, name#36 DESC NULLS LAST > +- Sort id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST, false, 0 > +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, > [plan_id=88] > +- Union > :- FileScan csv spark_catalog.default.table1id#35,name#36 > +- FileScan csv spark_catalog.default.table2id#37,name#38 {code} > > Although the two tables are bucketed by id column, there's still a exchange > plan after union.The reason is that union plan's output partitioning is null. > We can indroduce a new idea to optimize exchange plan: > # First introduce a new RDD, it consists of parent rdds that has the same > partition size. The ith parttition corresponds to ith partition of each > parent rdd. > # Then push the required distribution to union plan's children. If any child > output partitioning matches the required distribution , we can reduce this > child shuffle operation. > After doing these, the physical plan does not contain exchange shuffle plan > {code:bash} > AdaptiveSparkPlan isFinalPlan=false > +- Window row_number() windowspecdefinition(id#7, name#8 DESC NULLS LAST, > specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS > id_row_number#0, id#7, name#8 DESC NULLS LAST > +- Sort id#7 ASC NULLS FIRST, name#8 DESC NULLS LAST, false, 0 > +- UnionZip ClusteredDistribution(ArrayBuffer(id#7),false,None), > ClusteredDistribution(ArrayBuffer(id#9),false,None), hashpartitioning(id#7, > 200) > :- FileScan csv spark_catalog.default.table1id#7,name#8 > +- FileScan csv spark_catalog.default.table2id#9,name#10 {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42935) Optimze shuffle for union spark plan
[ https://issues.apache.org/jira/browse/SPARK-42935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Min updated SPARK-42935: - Description: Union plan does not take full advantage of children plan output partitionings when output partitoning can't match parent plan's required distribution. For example, Table1 and table2 are all bucketed table with bucket column id and bucket number 100. We will do row_number window function after union the two tables. {code:sql} create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO 100 BUCKETS; insert into table1 values(1, "s1"); insert into table1 values(2, "s2"); create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO 100 BUCKETS; insert into table2 values(1, "s3"); set spark.sql.shuffle.partitions=100; explain select *, row_number() over(partition by id order by name desc) id_row_number from (select * from table1 union all select * from table2);{code} The physical plan is {code:bash} AdaptiveSparkPlan isFinalPlan=false +- Window row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS id_row_number#28, id#35, name#36 DESC NULLS LAST +- Sort id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST, false, 0 +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, [plan_id=88] +- Union :- FileScan csv spark_catalog.default.table1id#35,name#36 +- FileScan csv spark_catalog.default.table2id#37,name#38 {code} Although the two tables are bucketed by id column, there's still a exchange plan after union.The reason is that union plan's output partitioning is null. We can indroduce a new idea to optimize exchange plan: # First introduce a new RDD, it consists of parent rdds that has the same partition size. The ith parttition corresponds to ith partition of each parent rdd. # Then push the required distribution to union plan's children. If any child output partitioning matches the required distribution , we can reduce this child shuffle operation. After doing these, the physical plan does not contain exchange shuffle plan {code:bash} AdaptiveSparkPlan isFinalPlan=false +- Window row_number() windowspecdefinition(id#7, name#8 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS id_row_number#0, id#7, name#8 DESC NULLS LAST +- Sort id#7 ASC NULLS FIRST, name#8 DESC NULLS LAST, false, 0 +- UnionZip ClusteredDistribution(ArrayBuffer(id#7),false,None), ClusteredDistribution(ArrayBuffer(id#9),false,None), hashpartitioning(id#7, 200) :- FileScan csv spark_catalog.default.table1id#7,name#8 +- FileScan csv spark_catalog.default.table2id#9,name#10 {code} was: Union plan does not take full advantage of children plan output partitionings when output partitoning can't match parent plan's required distribution. For example, Table1 and table2 are all bucketed table with bucket column id and bucket number 100. We will do row_number window function after union the two tables. {code:sql} create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO 100 BUCKETS; insert into table1 values(1, "s1"); insert into table1 values(2, "s2"); create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO 100 BUCKETS; insert into table2 values(1, "s3"); set spark.sql.shuffle.partitions=100; explain select *, row_number() over(partition by id order by name desc) id_row_number from (select * from table1 union all select * from table2);{code} The physical plan is {code:bash} AdaptiveSparkPlan isFinalPlan=false +- Window row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS id_row_number#28, id#35, name#36 DESC NULLS LAST +- Sort id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST, false, 0 +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, [plan_id=88] +- Union :- FileScan csv spark_catalog.default.table1id#35,name#36 +- FileScan csv spark_catalog.default.table2id#37,name#38 {code} Although the two tables are bucketed by id column, there's still a exchange plan after union.The reason is that union plan's output partitioning is null. We can indroduce a new idea to optimize exchange plan: # First introduce a new RDD, it consists of parent rdds that has the same partition size. The ith parttition corresponds to ith partition of each parent rdd. # Then push the required distribution to union plan's children. If any child output partitioning matches the required distribution , we can reduce this child shuffle operation. After doing these, the physical plan is {code:bash} AdaptiveSparkPlan isFinalPlan=false +- Window row_number() windowspecdefinition(id#7, name#8 DESC NULLS LAST, specifiedwindowframe(RowFrame,
[jira] [Created] (SPARK-42942) Support coalesce table cache stage partitions
XiDuo You created SPARK-42942: - Summary: Support coalesce table cache stage partitions Key: SPARK-42942 URL: https://issues.apache.org/jira/browse/SPARK-42942 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: XiDuo You If people cache a plan which holds some small partitions, then we can coalesce those just like what we did for shuffle. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42941) Add support for streaming listener.
Raghu Angadi created SPARK-42941: Summary: Add support for streaming listener. Key: SPARK-42941 URL: https://issues.apache.org/jira/browse/SPARK-42941 Project: Spark Issue Type: Task Components: Connect, Structured Streaming Affects Versions: 3.5.0 Reporter: Raghu Angadi Add support of streaming listener in Python. This likely requires a design doc to hash out the details. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42940) Session management support streaming connect
Raghu Angadi created SPARK-42940: Summary: Session management support streaming connect Key: SPARK-42940 URL: https://issues.apache.org/jira/browse/SPARK-42940 Project: Spark Issue Type: Task Components: Connect, Structured Streaming Affects Versions: 3.5.0 Reporter: Raghu Angadi Add session support for streaming jobs. E.g. a session should stay alive when a streaming job is alive. It might differ more complex scenarios like what happens when client loses track of the session. Such semantics would be handled as part of session semantics across Spark Connect (including streaming). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42939) Streaming connect core API
Raghu Angadi created SPARK-42939: Summary: Streaming connect core API Key: SPARK-42939 URL: https://issues.apache.org/jira/browse/SPARK-42939 Project: Spark Issue Type: Task Components: Connect, Structured Streaming Affects Versions: 3.5.0 Reporter: Raghu Angadi This adds support for majority of streaming API, including query interactions. This does not include advanced features like custom streaming listener, and `foreachBatch()`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42937) Join with subquery in condition can fail with wholestage codegen and adaptive execution disabled
[ https://issues.apache.org/jira/browse/SPARK-42937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705702#comment-17705702 ] Bruce Robbins commented on SPARK-42937: --- PR at https://github.com/apache/spark/pull/40569 > Join with subquery in condition can fail with wholestage codegen and adaptive > execution disabled > > > Key: SPARK-42937 > URL: https://issues.apache.org/jira/browse/SPARK-42937 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.0, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > > The below left outer join gets an error: > {noformat} > create or replace temp view v1 as > select * from values > (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), > (2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), > (3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) > as v1(key, value1, value2, value3, value4, value5, value6, value7, value8, > value9, value10); > create or replace temp view v2 as > select * from values > (1, 2), > (3, 8), > (7, 9) > as v2(a, b); > create or replace temp view v3 as > select * from values > (3), > (8) > as v3(col1); > set spark.sql.codegen.maxFields=10; -- let's make maxFields 10 instead of 100 > set spark.sql.adaptive.enabled=false; > select * > from v1 > left outer join v2 > on key = a > and key in (select col1 from v3); > {noformat} > The join fails during predicate codegen: > {noformat} > 23/03/27 12:24:12 WARN Predicate: Expr codegen error and falling back to > interpreter mode > java.lang.IllegalArgumentException: requirement failed: input[0, int, false] > IN subquery#34 has not finished > at scala.Predef$.require(Predef.scala:281) > at > org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144) > at > org.apache.spark.sql.execution.InSubqueryExec.doGenCode(subquery.scala:156) > at > org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:201) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:196) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.$anonfun$generateExpressions$2(CodeGenerator.scala:1278) > at scala.collection.immutable.List.map(List.scala:293) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.generateExpressions(CodeGenerator.scala:1278) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:41) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:33) > at > org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:73) > at > org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:70) > at > org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:51) > at > org.apache.spark.sql.catalyst.expressions.Predicate$.create(predicates.scala:86) > at > org.apache.spark.sql.execution.joins.HashJoin.boundCondition(HashJoin.scala:146) > at > org.apache.spark.sql.execution.joins.HashJoin.boundCondition$(HashJoin.scala:140) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition$lzycompute(BroadcastHashJoinExec.scala:40) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition(BroadcastHashJoinExec.scala:40) > {noformat} > It fails again after fallback to interpreter mode: > {noformat} > 23/03/27 12:24:12 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7) > java.lang.IllegalArgumentException: requirement failed: input[0, int, false] > IN subquery#34 has not finished > at scala.Predef$.require(Predef.scala:281) > at > org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144) > at > org.apache.spark.sql.execution.InSubqueryExec.eval(subquery.scala:151) > at > org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:52) > at > org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2(HashJoin.scala:146) > at > org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2$adapted(HashJoin.scala:146) > at > org.apache.spark.sql.execution.joins.HashJoin.$anonfun$outerJoin$1(HashJoin.scala:205) > {noformat} > Both the predicate codegen and the evaluation fail for the same reason: > {{PlanSubqueries}} creates {{InSubqueryExec}} with {{shouldBroadcast=false}}. > The driver waits for the subquery to finish, but it's the executor that uses > the results of the subquery (for predicate codegen
[jira] [Resolved] (SPARK-42382) Upgrade `cyclonedx-maven-plugin` to 2.7.5
[ https://issues.apache.org/jira/browse/SPARK-42382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42382. --- Resolution: Won't Fix We will revisit this when we are able to find a valid combination of Maven and cyclonedx-maven-plugin again in the future with the different versions. > Upgrade `cyclonedx-maven-plugin` to 2.7.5 > - > > Key: SPARK-42382 > URL: https://issues.apache.org/jira/browse/SPARK-42382 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Minor > > [https://github.com/CycloneDX/cyclonedx-maven-plugin/releases/tag/cyclonedx-maven-plugin-2.7.4] > [https://github.com/CycloneDX/cyclonedx-maven-plugin/releases/tag/cyclonedx-maven-plugin-2.7.5] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42380) Upgrade maven to 3.9.0
[ https://issues.apache.org/jira/browse/SPARK-42380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42380. --- Resolution: Won't Fix We will revisit this when we are able to find a valid combination of Maven and cyclonedx-maven-plugin again in the future. > Upgrade maven to 3.9.0 > -- > > Key: SPARK-42380 > URL: https://issues.apache.org/jira/browse/SPARK-42380 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Minor > > {code:java} > [ERROR] An error occurred attempting to read POM > org.codehaus.plexus.util.xml.pull.XmlPullParserException: UTF-8 BOM plus xml > decl of ISO-8859-1 is incompatible (position: START_DOCUMENT seen version="1.0" encoding="ISO-8859-1"... @1:42) > at org.codehaus.plexus.util.xml.pull.MXParser.parseXmlDeclWithVersion > (MXParser.java:3423) > at org.codehaus.plexus.util.xml.pull.MXParser.parseXmlDecl > (MXParser.java:3345) > at org.codehaus.plexus.util.xml.pull.MXParser.parsePI (MXParser.java:3197) > at org.codehaus.plexus.util.xml.pull.MXParser.parseProlog > (MXParser.java:1828) > at org.codehaus.plexus.util.xml.pull.MXParser.nextImpl > (MXParser.java:1757) > at org.codehaus.plexus.util.xml.pull.MXParser.next (MXParser.java:1375) > at org.apache.maven.model.io.xpp3.MavenXpp3Reader.read > (MavenXpp3Reader.java:3940) > at org.apache.maven.model.io.xpp3.MavenXpp3Reader.read > (MavenXpp3Reader.java:612) > at org.apache.maven.model.io.xpp3.MavenXpp3Reader.read > (MavenXpp3Reader.java:627) > at org.cyclonedx.maven.BaseCycloneDxMojo.readPom > (BaseCycloneDxMojo.java:759) > at org.cyclonedx.maven.BaseCycloneDxMojo.readPom > (BaseCycloneDxMojo.java:746) > at org.cyclonedx.maven.BaseCycloneDxMojo.retrieveParentProject > (BaseCycloneDxMojo.java:694) > at org.cyclonedx.maven.BaseCycloneDxMojo.getClosestMetadata > (BaseCycloneDxMojo.java:524) > at org.cyclonedx.maven.BaseCycloneDxMojo.convert > (BaseCycloneDxMojo.java:481) > at org.cyclonedx.maven.CycloneDxMojo.execute (CycloneDxMojo.java:70) > at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo > (DefaultBuildPluginManager.java:126) > at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2 > (MojoExecutor.java:342) > at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute > (MojoExecutor.java:330) > at org.apache.maven.lifecycle.internal.MojoExecutor.execute > (MojoExecutor.java:213) > at org.apache.maven.lifecycle.internal.MojoExecutor.execute > (MojoExecutor.java:175) > at org.apache.maven.lifecycle.internal.MojoExecutor.access$000 > (MojoExecutor.java:76) > at org.apache.maven.lifecycle.internal.MojoExecutor$1.run > (MojoExecutor.java:163) > at org.apache.maven.plugin.DefaultMojosExecutionStrategy.execute > (DefaultMojosExecutionStrategy.java:39) > at org.apache.maven.lifecycle.internal.MojoExecutor.execute > (MojoExecutor.java:160) > at > org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject > (LifecycleModuleBuilder.java:105) > at > org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject > (LifecycleModuleBuilder.java:73) > at > org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build > (SingleThreadedBuilder.java:53) > at org.apache.maven.lifecycle.internal.LifecycleStarter.execute > (LifecycleStarter.java:118) > at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:260) > at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:172) > at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:100) > at org.apache.maven.cli.MavenCli.execute (MavenCli.java:821) > at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:270) > at org.apache.maven.cli.MavenCli.main (MavenCli.java:192) > at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke > (NativeMethodAccessorImpl.java:62) > at sun.reflect.DelegatingMethodAccessorImpl.invoke > (DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke (Method.java:498) > at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced > (Launcher.java:282) > at org.codehaus.plexus.classworlds.launcher.Launcher.launch > (Launcher.java:225) > at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode > (Launcher.java:406) > at org.codehaus.plexus.classworlds.launcher.Launcher.main > (Launcher.java:347) > {code} > A existing problem -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional
[jira] [Created] (SPARK-42938) Structured Streaming with Spark Connect
Raghu Angadi created SPARK-42938: Summary: Structured Streaming with Spark Connect Key: SPARK-42938 URL: https://issues.apache.org/jira/browse/SPARK-42938 Project: Spark Issue Type: Epic Components: Connect, Structured Streaming Affects Versions: 3.5.0 Reporter: Raghu Angadi Spark-connect (SPARK-39375) should support structured streaming. This Epic covers various tasks for adding streaming support. The current focus is on {*}Python API{*}, unless otherwise stated, the tasks below are for Python API. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42906) Replace a starting digit with `x` in resource name prefix
[ https://issues.apache.org/jira/browse/SPARK-42906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-42906: -- Summary: Replace a starting digit with `x` in resource name prefix (was: Replace a starting digit with x in resource name prefix) > Replace a starting digit with `x` in resource name prefix > - > > Key: SPARK-42906 > URL: https://issues.apache.org/jira/browse/SPARK-42906 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.3, 3.3.2, 3.4.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Fix For: 3.2.4, 3.3.3, 3.4.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42906) Replace a starting digit with x in resource name prefix
[ https://issues.apache.org/jira/browse/SPARK-42906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-42906: -- Summary: Replace a starting digit with x in resource name prefix (was: Resource name prefix should start with an alphabetic character) > Replace a starting digit with x in resource name prefix > --- > > Key: SPARK-42906 > URL: https://issues.apache.org/jira/browse/SPARK-42906 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.3, 3.3.2, 3.4.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Fix For: 3.2.4, 3.3.3, 3.4.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42906) Resource name prefix should start with an alphabetic character
[ https://issues.apache.org/jira/browse/SPARK-42906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42906: - Assignee: Cheng Pan > Resource name prefix should start with an alphabetic character > -- > > Key: SPARK-42906 > URL: https://issues.apache.org/jira/browse/SPARK-42906 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.3, 3.3.2, 3.4.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42906) Resource name prefix should start with an alphabetic character
[ https://issues.apache.org/jira/browse/SPARK-42906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42906. --- Fix Version/s: 3.3.3 3.2.4 3.4.1 Resolution: Fixed Issue resolved by pull request 40533 [https://github.com/apache/spark/pull/40533] > Resource name prefix should start with an alphabetic character > -- > > Key: SPARK-42906 > URL: https://issues.apache.org/jira/browse/SPARK-42906 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.3, 3.3.2, 3.4.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Fix For: 3.3.3, 3.2.4, 3.4.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42906) Resource name prefix should start with an alphabetic character
[ https://issues.apache.org/jira/browse/SPARK-42906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-42906: -- Affects Version/s: 3.3.2 3.4.0 > Resource name prefix should start with an alphabetic character > -- > > Key: SPARK-42906 > URL: https://issues.apache.org/jira/browse/SPARK-42906 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.3, 3.3.2, 3.4.0 >Reporter: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42937) Join with subquery in condition can fail with wholestage codegen and adaptive execution disabled
[ https://issues.apache.org/jira/browse/SPARK-42937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-42937: -- Affects Version/s: 3.4.0 > Join with subquery in condition can fail with wholestage codegen and adaptive > execution disabled > > > Key: SPARK-42937 > URL: https://issues.apache.org/jira/browse/SPARK-42937 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.0, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > > The below left outer join gets an error: > {noformat} > create or replace temp view v1 as > select * from values > (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), > (2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), > (3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) > as v1(key, value1, value2, value3, value4, value5, value6, value7, value8, > value9, value10); > create or replace temp view v2 as > select * from values > (1, 2), > (3, 8), > (7, 9) > as v2(a, b); > create or replace temp view v3 as > select * from values > (3), > (8) > as v3(col1); > set spark.sql.codegen.maxFields=10; -- let's make maxFields 10 instead of 100 > set spark.sql.adaptive.enabled=false; > select * > from v1 > left outer join v2 > on key = a > and key in (select col1 from v3); > {noformat} > The join fails during predicate codegen: > {noformat} > 23/03/27 12:24:12 WARN Predicate: Expr codegen error and falling back to > interpreter mode > java.lang.IllegalArgumentException: requirement failed: input[0, int, false] > IN subquery#34 has not finished > at scala.Predef$.require(Predef.scala:281) > at > org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144) > at > org.apache.spark.sql.execution.InSubqueryExec.doGenCode(subquery.scala:156) > at > org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:201) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:196) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.$anonfun$generateExpressions$2(CodeGenerator.scala:1278) > at scala.collection.immutable.List.map(List.scala:293) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.generateExpressions(CodeGenerator.scala:1278) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:41) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:33) > at > org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:73) > at > org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:70) > at > org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:51) > at > org.apache.spark.sql.catalyst.expressions.Predicate$.create(predicates.scala:86) > at > org.apache.spark.sql.execution.joins.HashJoin.boundCondition(HashJoin.scala:146) > at > org.apache.spark.sql.execution.joins.HashJoin.boundCondition$(HashJoin.scala:140) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition$lzycompute(BroadcastHashJoinExec.scala:40) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition(BroadcastHashJoinExec.scala:40) > {noformat} > It fails again after fallback to interpreter mode: > {noformat} > 23/03/27 12:24:12 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7) > java.lang.IllegalArgumentException: requirement failed: input[0, int, false] > IN subquery#34 has not finished > at scala.Predef$.require(Predef.scala:281) > at > org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144) > at > org.apache.spark.sql.execution.InSubqueryExec.eval(subquery.scala:151) > at > org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:52) > at > org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2(HashJoin.scala:146) > at > org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2$adapted(HashJoin.scala:146) > at > org.apache.spark.sql.execution.joins.HashJoin.$anonfun$outerJoin$1(HashJoin.scala:205) > {noformat} > Both the predicate codegen and the evaluation fail for the same reason: > {{PlanSubqueries}} creates {{InSubqueryExec}} with {{shouldBroadcast=false}}. > The driver waits for the subquery to finish, but it's the executor that uses > the results of the subquery (for predicate codegen or evaluation). Because > {{shouldBroadcast}} is set to
[jira] [Updated] (SPARK-42937) Join with subquery in condition can fail with wholestage codegen and adaptive execution disabled
[ https://issues.apache.org/jira/browse/SPARK-42937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-42937: -- Affects Version/s: 3.3.2 > Join with subquery in condition can fail with wholestage codegen and adaptive > execution disabled > > > Key: SPARK-42937 > URL: https://issues.apache.org/jira/browse/SPARK-42937 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > > The below left outer join gets an error: > {noformat} > create or replace temp view v1 as > select * from values > (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), > (2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), > (3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) > as v1(key, value1, value2, value3, value4, value5, value6, value7, value8, > value9, value10); > create or replace temp view v2 as > select * from values > (1, 2), > (3, 8), > (7, 9) > as v2(a, b); > create or replace temp view v3 as > select * from values > (3), > (8) > as v3(col1); > set spark.sql.codegen.maxFields=10; -- let's make maxFields 10 instead of 100 > set spark.sql.adaptive.enabled=false; > select * > from v1 > left outer join v2 > on key = a > and key in (select col1 from v3); > {noformat} > The join fails during predicate codegen: > {noformat} > 23/03/27 12:24:12 WARN Predicate: Expr codegen error and falling back to > interpreter mode > java.lang.IllegalArgumentException: requirement failed: input[0, int, false] > IN subquery#34 has not finished > at scala.Predef$.require(Predef.scala:281) > at > org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144) > at > org.apache.spark.sql.execution.InSubqueryExec.doGenCode(subquery.scala:156) > at > org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:201) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:196) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.$anonfun$generateExpressions$2(CodeGenerator.scala:1278) > at scala.collection.immutable.List.map(List.scala:293) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.generateExpressions(CodeGenerator.scala:1278) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:41) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:33) > at > org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:73) > at > org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:70) > at > org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:51) > at > org.apache.spark.sql.catalyst.expressions.Predicate$.create(predicates.scala:86) > at > org.apache.spark.sql.execution.joins.HashJoin.boundCondition(HashJoin.scala:146) > at > org.apache.spark.sql.execution.joins.HashJoin.boundCondition$(HashJoin.scala:140) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition$lzycompute(BroadcastHashJoinExec.scala:40) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition(BroadcastHashJoinExec.scala:40) > {noformat} > It fails again after fallback to interpreter mode: > {noformat} > 23/03/27 12:24:12 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7) > java.lang.IllegalArgumentException: requirement failed: input[0, int, false] > IN subquery#34 has not finished > at scala.Predef$.require(Predef.scala:281) > at > org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144) > at > org.apache.spark.sql.execution.InSubqueryExec.eval(subquery.scala:151) > at > org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:52) > at > org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2(HashJoin.scala:146) > at > org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2$adapted(HashJoin.scala:146) > at > org.apache.spark.sql.execution.joins.HashJoin.$anonfun$outerJoin$1(HashJoin.scala:205) > {noformat} > Both the predicate codegen and the evaluation fail for the same reason: > {{PlanSubqueries}} creates {{InSubqueryExec}} with {{shouldBroadcast=false}}. > The driver waits for the subquery to finish, but it's the executor that uses > the results of the subquery (for predicate codegen or evaluation). Because > {{shouldBroadcast}} is set to false, the
[jira] [Created] (SPARK-42937) Join with subquery in condition can fail with wholestage codegen and adaptive execution disabled
Bruce Robbins created SPARK-42937: - Summary: Join with subquery in condition can fail with wholestage codegen and adaptive execution disabled Key: SPARK-42937 URL: https://issues.apache.org/jira/browse/SPARK-42937 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: Bruce Robbins The below left outer join gets an error: {noformat} create or replace temp view v1 as select * from values (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), (3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) as v1(key, value1, value2, value3, value4, value5, value6, value7, value8, value9, value10); create or replace temp view v2 as select * from values (1, 2), (3, 8), (7, 9) as v2(a, b); create or replace temp view v3 as select * from values (3), (8) as v3(col1); set spark.sql.codegen.maxFields=10; -- let's make maxFields 10 instead of 100 set spark.sql.adaptive.enabled=false; select * from v1 left outer join v2 on key = a and key in (select col1 from v3); {noformat} The join fails during predicate codegen: {noformat} 23/03/27 12:24:12 WARN Predicate: Expr codegen error and falling back to interpreter mode java.lang.IllegalArgumentException: requirement failed: input[0, int, false] IN subquery#34 has not finished at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144) at org.apache.spark.sql.execution.InSubqueryExec.doGenCode(subquery.scala:156) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:201) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:196) at org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.$anonfun$generateExpressions$2(CodeGenerator.scala:1278) at scala.collection.immutable.List.map(List.scala:293) at org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.generateExpressions(CodeGenerator.scala:1278) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:41) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:33) at org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:73) at org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:70) at org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:51) at org.apache.spark.sql.catalyst.expressions.Predicate$.create(predicates.scala:86) at org.apache.spark.sql.execution.joins.HashJoin.boundCondition(HashJoin.scala:146) at org.apache.spark.sql.execution.joins.HashJoin.boundCondition$(HashJoin.scala:140) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition$lzycompute(BroadcastHashJoinExec.scala:40) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition(BroadcastHashJoinExec.scala:40) {noformat} It fails again after fallback to interpreter mode: {noformat} 23/03/27 12:24:12 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7) java.lang.IllegalArgumentException: requirement failed: input[0, int, false] IN subquery#34 has not finished at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144) at org.apache.spark.sql.execution.InSubqueryExec.eval(subquery.scala:151) at org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:52) at org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2(HashJoin.scala:146) at org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2$adapted(HashJoin.scala:146) at org.apache.spark.sql.execution.joins.HashJoin.$anonfun$outerJoin$1(HashJoin.scala:205) {noformat} Both the predicate codegen and the evaluation fail for the same reason: {{PlanSubqueries}} creates {{InSubqueryExec}} with {{shouldBroadcast=false}}. The driver waits for the subquery to finish, but it's the executor that uses the results of the subquery (for predicate codegen or evaluation). Because {{shouldBroadcast}} is set to false, the result is stored in a transient field ({{InSubqueryExec#result}}), so the result of the subquery is not serialized when the {{InSubqueryExec}} instance is sent to the executor. When wholestage codegen is enabled, the predicate codegen happens on the driver, so the subquery's result is available. When adaptive execution is enabled, {{PlanAdaptiveSubqueries}} always sets {{shouldBroadcast=true}}, so the
[jira] [Resolved] (SPARK-36999) Document the command ALTER TABLE RECOVER PARTITIONS
[ https://issues.apache.org/jira/browse/SPARK-36999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-36999. -- Resolution: Duplicate > Document the command ALTER TABLE RECOVER PARTITIONS > --- > > Key: SPARK-36999 > URL: https://issues.apache.org/jira/browse/SPARK-36999 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > Labels: starter > > Update the page > [https://spark.apache.org/docs/3.1.2/sql-ref-syntax-ddl-alter-table.html,] > and document the command ALTER TABLE RECOVER PARTITIONS -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42936) Unresolved having at the end of analysis when using with LCA with the having clause that can be resolved directly by its child Aggregate
[ https://issues.apache.org/jira/browse/SPARK-42936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705481#comment-17705481 ] Xinyi Yu commented on SPARK-42936: -- Created PR fixing the issue: https://github.com/apache/spark/pull/40558 > Unresolved having at the end of analysis when using with LCA with the having > clause that can be resolved directly by its child Aggregate > > > Key: SPARK-42936 > URL: https://issues.apache.org/jira/browse/SPARK-42936 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Xinyi Yu >Priority: Major > > {code:java} > select sum(value1) as total_1, total_1 > from values(1, 'name', 100, 50) AS data(id, name, value1, value2) > having total_1 > 0 > SparkException: [INTERNAL_ERROR] Found the unresolved operator: > 'UnresolvedHaving (total_1#353L > cast(0 as bigint)) {code} > To trigger the issue, the having condition need to be (can be resolved by) an > attribute in the select. > Without the LCA {{{}total_1{}}}, the query works fine. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42936) Unresolved having at the end of analysis when using with LCA with the having clause that can be resolved directly by its child Aggregate
Xinyi Yu created SPARK-42936: Summary: Unresolved having at the end of analysis when using with LCA with the having clause that can be resolved directly by its child Aggregate Key: SPARK-42936 URL: https://issues.apache.org/jira/browse/SPARK-42936 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Xinyi Yu {code:java} select sum(value1) as total_1, total_1 from values(1, 'name', 100, 50) AS data(id, name, value1, value2) having total_1 > 0 SparkException: [INTERNAL_ERROR] Found the unresolved operator: 'UnresolvedHaving (total_1#353L > cast(0 as bigint)) {code} To trigger the issue, the having condition need to be (can be resolved by) an attribute in the select. Without the LCA {{{}total_1{}}}, the query works fine. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42934) Testing OrcEncryptionSuite using maven is always skipped
[ https://issues.apache.org/jira/browse/SPARK-42934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42934. --- Fix Version/s: 3.3.3 3.2.4 3.4.1 Resolution: Fixed Issue resolved by pull request 40566 [https://github.com/apache/spark/pull/40566] > Testing OrcEncryptionSuite using maven is always skipped > > > Key: SPARK-42934 > URL: https://issues.apache.org/jira/browse/SPARK-42934 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.2.3, 3.3.2, 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.3.3, 3.2.4, 3.4.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42934) Testing OrcEncryptionSuite using maven is always skipped
[ https://issues.apache.org/jira/browse/SPARK-42934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-42934: -- Affects Version/s: 3.3.2 3.2.3 3.4.0 (was: 3.5.0) > Testing OrcEncryptionSuite using maven is always skipped > > > Key: SPARK-42934 > URL: https://issues.apache.org/jira/browse/SPARK-42934 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.2.3, 3.3.2, 3.4.0 >Reporter: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42934) Testing OrcEncryptionSuite using maven is always skipped
[ https://issues.apache.org/jira/browse/SPARK-42934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42934: - Assignee: Yang Jie > Testing OrcEncryptionSuite using maven is always skipped > > > Key: SPARK-42934 > URL: https://issues.apache.org/jira/browse/SPARK-42934 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.2.3, 3.3.2, 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42934) Testing OrcEncryptionSuite using maven is always skipped
[ https://issues.apache.org/jira/browse/SPARK-42934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-42934: -- Priority: Minor (was: Major) > Testing OrcEncryptionSuite using maven is always skipped > > > Key: SPARK-42934 > URL: https://issues.apache.org/jira/browse/SPARK-42934 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42930) Change the access scope of `ProtobufSerDe` related implementations to `private[spark]`
[ https://issues.apache.org/jira/browse/SPARK-42930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42930: - Assignee: Yang Jie > Change the access scope of `ProtobufSerDe` related implementations to > `private[spark]` > -- > > Key: SPARK-42930 > URL: https://issues.apache.org/jira/browse/SPARK-42930 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42930) Change the access scope of `ProtobufSerDe` related implementations to `private[spark]`
[ https://issues.apache.org/jira/browse/SPARK-42930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42930. --- Fix Version/s: 3.4.1 Resolution: Fixed Issue resolved by pull request 40560 [https://github.com/apache/spark/pull/40560] > Change the access scope of `ProtobufSerDe` related implementations to > `private[spark]` > -- > > Key: SPARK-42930 > URL: https://issues.apache.org/jira/browse/SPARK-42930 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.4.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42913) Upgrade Hadoop to 3.3.5
[ https://issues.apache.org/jira/browse/SPARK-42913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42913. --- Fix Version/s: 3.5.0 Assignee: Yang Jie Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/39124 > Upgrade Hadoop to 3.3.5 > --- > > Key: SPARK-42913 > URL: https://issues.apache.org/jira/browse/SPARK-42913 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.5.0 > > > * > [https://hadoop.apache.org/docs/r3.3.5/hadoop-project-dist/hadoop-common/release/3.3.5/CHANGELOG.3.3.5.html] > * > [https://hadoop.apache.org/docs/r3.3.5/hadoop-project-dist/hadoop-common/release/3.3.5/RELEASENOTES.3.3.5.html] > * https://hadoop.apache.org/docs/r3.3.5/ -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42935) Optimze shuffle for union spark plan
[ https://issues.apache.org/jira/browse/SPARK-42935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Min updated SPARK-42935: - Description: Union plan does not take full advantage of children plan output partitionings when output partitoning can't match parent plan's required distribution. For example, Table1 and table2 are all bucketed table with bucket column id and bucket number 100. We will do row_number window function after union the two tables. {code:sql} create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO 100 BUCKETS; insert into table1 values(1, "s1"); insert into table1 values(2, "s2"); create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO 100 BUCKETS; insert into table2 values(1, "s3"); set spark.sql.shuffle.partitions=100; explain select *, row_number() over(partition by id order by name desc) id_row_number from (select * from table1 union all select * from table2);{code} The physical plan is {code:bash} AdaptiveSparkPlan isFinalPlan=false +- Window row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS id_row_number#28, id#35, name#36 DESC NULLS LAST +- Sort id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST, false, 0 +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, [plan_id=88] +- Union :- FileScan csv spark_catalog.default.table1id#35,name#36 +- FileScan csv spark_catalog.default.table2id#37,name#38 {code} Although the two tables are bucketed by id column, there's still a exchange plan after union.The reason is that union plan's output partitioning is null. We can indroduce a new idea to optimize exchange plan: # First introduce a new RDD, it consists of parent rdds that has the same partition size. The ith parttition corresponds to ith partition of each parent rdd. # Then push the required distribution to union plan's children. If any child output partitioning matches the required distribution , we can reduce this child shuffle operation. After doing these, the physical plan is {code:bash} AdaptiveSparkPlan isFinalPlan=false +- Window row_number() windowspecdefinition(id#7, name#8 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS id_row_number#0, id#7, name#8 DESC NULLS LAST +- Sort id#7 ASC NULLS FIRST, name#8 DESC NULLS LAST, false, 0 +- UnionZip ClusteredDistribution(ArrayBuffer(id#7),false,None), ClusteredDistribution(ArrayBuffer(id#9),false,None), hashpartitioning(id#7, 200) :- FileScan csv spark_catalog.default.table1id#7,name#8 +- FileScan csv spark_catalog.default.table2id#9,name#10 {code} was: Union plan does not take full advantage of children plan output partitionings when output partitoning can't match parent plan's required distribution. For example, Table1 and table2 are all bucketed table with bucket column id and bucket number 100. We will do row_number window function after union the two tables. {code:sql} create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO 100 BUCKETS; insert into table1 values(1, "s1"); insert into table1 values(2, "s2"); create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO 100 BUCKETS; insert into table2 values(1, "s3"); set spark.sql.shuffle.partitions=100; explain select *, row_number() over(partition by id order by name desc) id_row_number from (select * from table1 union all select * from table2);{code} The physical plan is {code:bash} AdaptiveSparkPlan isFinalPlan=false +- Window row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS id_row_number#28, id#35, name#36 DESC NULLS LAST +- Sort id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST, false, 0 +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, [plan_id=88] +- Union :- FileScan csv spark_catalog.default.table1id#35,name#36 +- FileScan csv spark_catalog.default.table2id#37,name#38 {code} Although the two tables are bucketed by id column, there's still a exchange plan after union.The reason is that union plan's output partitioning is null. We can indroduce a new idea to optimize exchange plan: # First introduce a new RDD, it consists of parent rdds that has the same partition size. The ith parttition corresponds to ith partition of each parent rdd. # Then push the required distribution to union plan's children. If any child output partitioning matches the required distribution , we can reduce this child shuffle operation. After doing these, the physical plan is {code:bash} daptiveSparkPlan isFinalPlan=false +- Window row_number() windowspecdefinition(id#7, name#8 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS
[jira] [Updated] (SPARK-42935) Optimze shuffle for union spark plan
[ https://issues.apache.org/jira/browse/SPARK-42935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Min updated SPARK-42935: - Description: Union plan does not take full advantage of children plan output partitionings when output partitoning can't match parent plan's required distribution. For example, Table1 and table2 are all bucketed table with bucket column id and bucket number 100. We will do row_number window function after union the two tables. {code:sql} create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO 100 BUCKETS; insert into table1 values(1, "s1"); insert into table1 values(2, "s2"); create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO 100 BUCKETS; insert into table2 values(1, "s3"); set spark.sql.shuffle.partitions=100; explain select *, row_number() over(partition by id order by name desc) id_row_number from (select * from table1 union all select * from table2);{code} The physical plan is {code:bash} Unable to find source-code formatter for language: shell. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yamlAdaptiveSparkPlan isFinalPlan=false +- Window row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS id_row_number#28, id#35, name#36 DESC NULLS LAST +- Sort id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST, false, 0 +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, [plan_id=88] +- Union :- FileScan csv spark_catalog.default.table1id#35,name#36 +- FileScan csv spark_catalog.default.table2id#37,name#38 {code} Although the two tables are bucketed by id column, there's still a exchange plan after union.The reason is that union plan's output partitioning is null. We can indroduce a new idea to optimize exchange plan: # First introduce a new RDD, it consists of parent rdds that has the same partition size. The ith parttition corresponds to ith partition of each parent rdd. # Then push the required distribution to union plan's children. If any child output partitioning matches the required distribution , we can reduce this child shuffle operation. After doing these, the physical plan is {code:bash} daptiveSparkPlan isFinalPlan=false +- Window row_number() windowspecdefinition(id#7, name#8 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS id_row_number#0, id#7, name#8 DESC NULLS LAST +- Sort id#7 ASC NULLS FIRST, name#8 DESC NULLS LAST, false, 0 +- UnionZip ClusteredDistribution(ArrayBuffer(id#7),false,None), ClusteredDistribution(ArrayBuffer(id#9),false,None), hashpartitioning(id#7, 200) :- FileScan csv spark_catalog.default.table1id#7,name#8 +- FileScan csv spark_catalog.default.table2id#9,name#10 {code} was: Union plan does not take full advantage of children plan output partitionings when output partitoning can't match parent plan's required distribution. For example, Table1 and table2 are all bucketed table with bucket column id and bucket number 100. We will do row_number window function after union the two tables. {code:sql} create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO 100 BUCKETS; insert into table1 values(1, "s1"); insert into table1 values(2, "s2"); create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO 100 BUCKETS; insert into table2 values(1, "s3"); set spark.sql.shuffle.partitions=100; explain select *, row_number() over(partition by id order by name desc) id_row_number from (select * from table1 union all select * from table2);{code} The physical plan is {code:bash} Unable to find source-code formatter for language: shell. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yamlAdaptiveSparkPlan isFinalPlan=false +- Window row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS id_row_number#28, id#35, name#36 DESC NULLS LAST +- Sort id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST, false, 0 +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, [plan_id=88] +- Union :- FileScan csv spark_catalog.default.table1id#35,name#36 +- FileScan csv spark_catalog.default.table2id#37,name#38 {code} Although the two tables are bucketed by id column, there's still a exchange plan after union.The reason is that union plan's output partitioning is null. We can
[jira] [Updated] (SPARK-42935) Optimze shuffle for union spark plan
[ https://issues.apache.org/jira/browse/SPARK-42935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Min updated SPARK-42935: - Description: Union plan does not take full advantage of children plan output partitionings when output partitoning can't match parent plan's required distribution. For example, Table1 and table2 are all bucketed table with bucket column id and bucket number 100. We will do row_number window function after union the two tables. {code:sql} create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO 100 BUCKETS; insert into table1 values(1, "s1"); insert into table1 values(2, "s2"); create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO 100 BUCKETS; insert into table2 values(1, "s3"); set spark.sql.shuffle.partitions=100; explain select *, row_number() over(partition by id order by name desc) id_row_number from (select * from table1 union all select * from table2);{code} The physical plan is {code:bash} AdaptiveSparkPlan isFinalPlan=false +- Window row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS id_row_number#28, id#35, name#36 DESC NULLS LAST +- Sort id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST, false, 0 +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, [plan_id=88] +- Union :- FileScan csv spark_catalog.default.table1id#35,name#36 +- FileScan csv spark_catalog.default.table2id#37,name#38 {code} Although the two tables are bucketed by id column, there's still a exchange plan after union.The reason is that union plan's output partitioning is null. We can indroduce a new idea to optimize exchange plan: # First introduce a new RDD, it consists of parent rdds that has the same partition size. The ith parttition corresponds to ith partition of each parent rdd. # Then push the required distribution to union plan's children. If any child output partitioning matches the required distribution , we can reduce this child shuffle operation. After doing these, the physical plan is {code:bash} daptiveSparkPlan isFinalPlan=false +- Window row_number() windowspecdefinition(id#7, name#8 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS id_row_number#0, id#7, name#8 DESC NULLS LAST +- Sort id#7 ASC NULLS FIRST, name#8 DESC NULLS LAST, false, 0 +- UnionZip ClusteredDistribution(ArrayBuffer(id#7),false,None), ClusteredDistribution(ArrayBuffer(id#9),false,None), hashpartitioning(id#7, 200) :- FileScan csv spark_catalog.default.table1id#7,name#8 +- FileScan csv spark_catalog.default.table2id#9,name#10 {code} was: Union plan does not take full advantage of children plan output partitionings when output partitoning can't match parent plan's required distribution. For example, Table1 and table2 are all bucketed table with bucket column id and bucket number 100. We will do row_number window function after union the two tables. {code:sql} create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO 100 BUCKETS; insert into table1 values(1, "s1"); insert into table1 values(2, "s2"); create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO 100 BUCKETS; insert into table2 values(1, "s3"); set spark.sql.shuffle.partitions=100; explain select *, row_number() over(partition by id order by name desc) id_row_number from (select * from table1 union all select * from table2);{code} The physical plan is {code:bash} Unable to find source-code formatter for language: shell. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yamlAdaptiveSparkPlan isFinalPlan=false +- Window row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS id_row_number#28, id#35, name#36 DESC NULLS LAST +- Sort id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST, false, 0 +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, [plan_id=88] +- Union :- FileScan csv spark_catalog.default.table1id#35,name#36 +- FileScan csv spark_catalog.default.table2id#37,name#38 {code} Although the two tables are bucketed by id column, there's still a exchange plan after union.The reason is that union plan's output partitioning is null. We can indroduce a new idea to optimize exchange plan: # First introduce a new RDD, it consists of parent rdds that has the same partition size. The ith parttition corresponds to ith partition of each parent rdd. # Then push the required distribution to union plan's children. If any child output partitioning matches
[jira] [Updated] (SPARK-42935) Optimze shuffle for union spark plan
[ https://issues.apache.org/jira/browse/SPARK-42935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Min updated SPARK-42935: - Description: Union plan does not take full advantage of children plan output partitionings when output partitoning can't match parent plan's required distribution. For example, Table1 and table2 are all bucketed table with bucket column id and bucket number 100. We will do row_number window function after union the two tables. {code:sql} create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO 100 BUCKETS; insert into table1 values(1, "s1"); insert into table1 values(2, "s2"); create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO 100 BUCKETS; insert into table2 values(1, "s3"); set spark.sql.shuffle.partitions=100; explain select *, row_number() over(partition by id order by name desc) id_row_number from (select * from table1 union all select * from table2);{code} The physical plan is {code:bash} Unable to find source-code formatter for language: shell. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yamlAdaptiveSparkPlan isFinalPlan=false +- Window row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS id_row_number#28, id#35, name#36 DESC NULLS LAST +- Sort id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST, false, 0 +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, [plan_id=88] +- Union :- FileScan csv spark_catalog.default.table1id#35,name#36 +- FileScan csv spark_catalog.default.table2id#37,name#38 {code} Although the two tables are bucketed by id column, there's still a exchange plan after union.The reason is that union plan's output partitioning is null. We can indroduce a new idea to optimize exchange plan: # First introduce a new RDD, it consists of parent rdds that has the same partition size. The ith parttition corresponds to ith partition of each parent rdd. # Then push the required distribution to union plan's children. If any child output partitioning matches the required distribution , we can reduce this child shuffle operation. After doing these, the physical plan is {code:bash} daptiveSparkPlan isFinalPlan=false +- Window row_number() windowspecdefinition(id#7, name#8 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS id_row_number#0, id#7, name#8 DESC NULLS LAST +- Sort id#7 ASC NULLS FIRST, name#8 DESC NULLS LAST, false, 0 +- UnionZip ClusteredDistribution(ArrayBuffer(id#7),false,None), ClusteredDistribution(ArrayBuffer(id#9),false,None), hashpartitioning(id#7, 200) :- FileScan csv spark_catalog.default.table1id#7,name#8 +- FileScan csv spark_catalog.default.table2id#9,name#10 {code} was: Union plan does not take full advantage of children plan output partitionings when output partitoning can't match parent plan's required distribution. For example, Table1 and table2 are all bucketed table with bucket column id and bucket number 100. We will do row_number window function after union the two tables. {code:sql} create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO 100 BUCKETS; insert into table1 values(1, "s1"); insert into table1 values(2, "s2"); create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO 100 BUCKETS; insert into table2 values(1, "s3"); set spark.sql.shuffle.partitions=100; explain select *, row_number() over(partition by id order by name desc) id_row_number from (select * from table1 union all select * from table2);{code} The physical plan is {code:shell} Unable to find source-code formatter for language: shell. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yamlAdaptiveSparkPlan isFinalPlan=false +- Window row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS id_row_number#28, id#35, name#36 DESC NULLS LAST +- Sort id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST, false, 0 +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, [plan_id=88] +- Union :- FileScan csv spark_catalog.default.table1id#35,name#36 +- FileScan csv spark_catalog.default.table2id#37,name#38 {code} Although the two tables are bucketed by id column, there's still a exchange plan after union.The reason is that union plan's output partitioning is null. We can
[jira] [Commented] (SPARK-36999) Document the command ALTER TABLE RECOVER PARTITIONS
[ https://issues.apache.org/jira/browse/SPARK-36999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705379#comment-17705379 ] Juanvi Soler commented on SPARK-36999: -- https://github.com/apache/spark/pull/35239 > Document the command ALTER TABLE RECOVER PARTITIONS > --- > > Key: SPARK-36999 > URL: https://issues.apache.org/jira/browse/SPARK-36999 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > Labels: starter > > Update the page > [https://spark.apache.org/docs/3.1.2/sql-ref-syntax-ddl-alter-table.html,] > and document the command ALTER TABLE RECOVER PARTITIONS -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36999) Document the command ALTER TABLE RECOVER PARTITIONS
[ https://issues.apache.org/jira/browse/SPARK-36999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705376#comment-17705376 ] Juanvi Soler commented on SPARK-36999: -- [~maxgekk] can this be closed? > Document the command ALTER TABLE RECOVER PARTITIONS > --- > > Key: SPARK-36999 > URL: https://issues.apache.org/jira/browse/SPARK-36999 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > Labels: starter > > Update the page > [https://spark.apache.org/docs/3.1.2/sql-ref-syntax-ddl-alter-table.html,] > and document the command ALTER TABLE RECOVER PARTITIONS -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42935) Optimze shuffle for union spark plan
[ https://issues.apache.org/jira/browse/SPARK-42935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Min updated SPARK-42935: - Description: Union plan does not take full advantage of children plan output partitionings when output partitoning can't match parent plan's required distribution. For example, Table1 and table2 are all bucketed table with bucket column id and bucket number 100. We will do row_number window function after union the two tables. {code:sql} create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO 100 BUCKETS; insert into table1 values(1, "s1"); insert into table1 values(2, "s2"); create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO 100 BUCKETS; insert into table2 values(1, "s3"); set spark.sql.shuffle.partitions=100; explain select *, row_number() over(partition by id order by name desc) id_row_number from (select * from table1 union all select * from table2);{code} The physical plan is {code:shell} Unable to find source-code formatter for language: shell. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yamlAdaptiveSparkPlan isFinalPlan=false +- Window row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS id_row_number#28, id#35, name#36 DESC NULLS LAST +- Sort id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST, false, 0 +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, [plan_id=88] +- Union :- FileScan csv spark_catalog.default.table1id#35,name#36 +- FileScan csv spark_catalog.default.table2id#37,name#38 {code} Although the two tables are bucketed by id column, there's still a exchange plan after union.The reason is that union plan's output partitioning is null. We can indroduce a new idea to optimize exchange plan: # First introduce a new RDD, it consists of parent rdds that has the same partition size. The ith parttition corresponds to ith partition of each parent rdd. # Then push the required distribution to union plan's children. If any child output partitioning matches the required distribution , we can reduce this child shuffle operation. After doing these, the physical plan is {code:shell} daptiveSparkPlan isFinalPlan=false +- Window row_number() windowspecdefinition(id#7, name#8 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS id_row_number#0, id#7, name#8 DESC NULLS LAST +- Sort id#7 ASC NULLS FIRST, name#8 DESC NULLS LAST, false, 0 +- UnionZip ClusteredDistribution(ArrayBuffer(id#7),false,None), ClusteredDistribution(ArrayBuffer(id#9),false,None), hashpartitioning(id#7, 200) :- FileScan csv spark_catalog.default.table1id#7,name#8 +- FileScan csv spark_catalog.default.table2id#9,name#10 {code} was: Union plan does not take full advantage of children plan output partitionings when output partitoning can't match parent plan's required distribution. For example, Table1 and table2 are all bucketed table with bucket column id and bucket number 100. We will do row_number window function after union the two tables. create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO 100 BUCKETS; insert into table1 values(1, "s1"); insert into table1 values(2, "s2"); create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO 100 BUCKETS; insert into table2 values(1, "s3"); set spark.sql.shuffle.partitions=100; explain select *, row_number() over(partition by id order by name desc) id_row_number from (select * from table1 union all select * from table2); The physical plan is AdaptiveSparkPlan isFinalPlan=false +- Window [row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS id_row_number#28], [id#35], [name#36 DESC NULLS LAST] +- Sort [id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST], false, 0 +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, [plan_id=88] +- Union :- FileScan csv spark_catalog.default.table1[id#35,name#36] +- FileScan csv spark_catalog.default.table2[id#37,name#38] Although the two tables are bucketed by id column, there's still a exchange plan after union.The reason is that union plan's output partitioning is null. We can indroduce a new idea to optimize exchange plan: # First introduce a new RDD, it consists of parent rdds that has the same partition size. The ith parttition corresponds to ith partition of each parent rdd. # Then push the required distribution to union plan's children. If any child output partitioning matches the required
[jira] [Created] (SPARK-42935) Optimze shuffle for union spark plan
Jeff Min created SPARK-42935: Summary: Optimze shuffle for union spark plan Key: SPARK-42935 URL: https://issues.apache.org/jira/browse/SPARK-42935 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: Jeff Min Fix For: 3.5.0 Union plan does not take full advantage of children plan output partitionings when output partitoning can't match parent plan's required distribution. For example, Table1 and table2 are all bucketed table with bucket column id and bucket number 100. We will do row_number window function after union the two tables. create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO 100 BUCKETS; insert into table1 values(1, "s1"); insert into table1 values(2, "s2"); create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO 100 BUCKETS; insert into table2 values(1, "s3"); set spark.sql.shuffle.partitions=100; explain select *, row_number() over(partition by id order by name desc) id_row_number from (select * from table1 union all select * from table2); The physical plan is AdaptiveSparkPlan isFinalPlan=false +- Window [row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS id_row_number#28], [id#35], [name#36 DESC NULLS LAST] +- Sort [id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST], false, 0 +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, [plan_id=88] +- Union :- FileScan csv spark_catalog.default.table1[id#35,name#36] +- FileScan csv spark_catalog.default.table2[id#37,name#38] Although the two tables are bucketed by id column, there's still a exchange plan after union.The reason is that union plan's output partitioning is null. We can indroduce a new idea to optimize exchange plan: # First introduce a new RDD, it consists of parent rdds that has the same partition size. The ith parttition corresponds to ith partition of each parent rdd. # Then push the required distribution to union plan's children. If any child output partitioning matches the required distribution , we can reduce this child shuffle operation. After doing these, the physical plan is daptiveSparkPlan isFinalPlan=false +- Window [row_number() windowspecdefinition(id#7, name#8 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS id_row_number#0], [id#7], [name#8 DESC NULLS LAST] +- Sort [id#7 ASC NULLS FIRST, name#8 DESC NULLS LAST], false, 0 +- UnionZip [ClusteredDistribution(ArrayBuffer(id#7),false,None), ClusteredDistribution(ArrayBuffer(id#9),false,None)], hashpartitioning(id#7, 200) :- FileScan csv spark_catalog.default.table1[id#7,name#8] +- FileScan csv spark_catalog.default.table2[id#9,name#10] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36999) Document the command ALTER TABLE RECOVER PARTITIONS
[ https://issues.apache.org/jira/browse/SPARK-36999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705355#comment-17705355 ] Juanvi Soler commented on SPARK-36999: -- This looks already solved [here|https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-alter-table.md#recover-partitions] > Document the command ALTER TABLE RECOVER PARTITIONS > --- > > Key: SPARK-36999 > URL: https://issues.apache.org/jira/browse/SPARK-36999 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > Labels: starter > > Update the page > [https://spark.apache.org/docs/3.1.2/sql-ref-syntax-ddl-alter-table.html,] > and document the command ALTER TABLE RECOVER PARTITIONS -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42934) Testing OrcEncryptionSuite using maven is always skipped
Yang Jie created SPARK-42934: Summary: Testing OrcEncryptionSuite using maven is always skipped Key: SPARK-42934 URL: https://issues.apache.org/jira/browse/SPARK-42934 Project: Spark Issue Type: Improvement Components: SQL, Tests Affects Versions: 3.5.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42929) make mapInPandas / mapInArrow support "is_barrier"
[ https://issues.apache.org/jira/browse/SPARK-42929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu resolved SPARK-42929. Fix Version/s: 3.5.0 Target Version/s: 3.5.0 Resolution: Fixed > make mapInPandas / mapInArrow support "is_barrier" > -- > > Key: SPARK-42929 > URL: https://issues.apache.org/jira/browse/SPARK-42929 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > Fix For: 3.5.0 > > > make mapInPandas / mapInArrow support "is_barrier" -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14516) Clustering evaluator
[ https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069815#comment-16069815 ] Marco Gaido edited comment on SPARK-14516 at 3/27/23 9:42 AM: -- Hello everybody, I have a proposal for a very efficient Silhouette implementation in a distributed environment. Here you can find the link with all the details of the solution. As soon as I will finish all the implementation and the tests I will post the PR for this: -[https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view|https://drive.google.com/file/d/0B0Hyo__bG_3fdkNvSVNYX2E3ZU0/view].- (please refer to https://arxiv.org/abs/2303.14102). Please tell me if you have any comment, doubt on it. Thanks. was (Author: mgaido): Hello everybody, I have a proposal for a very efficient Silhouette implementation in a distributed environment. Here you can find the link with all the details of the solution. As soon as I will finish all the implementation and the tests I will post the PR for this: https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view. Please tell me if you have any comment, doubt on it. Thanks. > Clustering evaluator > > > Key: SPARK-14516 > URL: https://issues.apache.org/jira/browse/SPARK-14516 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: Ruifeng Zheng >Assignee: Marco Gaido >Priority: Major > Fix For: 2.3.0 > > > MLlib does not have any general purposed clustering metrics with a ground > truth. > In > [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics), > there are several kinds of metrics for this. > It may be meaningful to add some clustering metrics into MLlib. > This should be added as a {{ClusteringEvaluator}} class of extending > {{Evaluator}} in spark.ml. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14516) Clustering evaluator
[ https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705265#comment-17705265 ] Marco Gaido commented on SPARK-14516: - As there are some issues with the Google Doc sharing, I prepared a paper that explains the metric and its definition, please disregard the previous link and refer to [https://arxiv.org/abs/2303.14102] if you are interested. It would be also easier for me to update it in the future if needed. Thanks. > Clustering evaluator > > > Key: SPARK-14516 > URL: https://issues.apache.org/jira/browse/SPARK-14516 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: Ruifeng Zheng >Assignee: Marco Gaido >Priority: Major > Fix For: 2.3.0 > > > MLlib does not have any general purposed clustering metrics with a ground > truth. > In > [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics), > there are several kinds of metrics for this. > It may be meaningful to add some clustering metrics into MLlib. > This should be added as a {{ClusteringEvaluator}} class of extending > {{Evaluator}} in spark.ml. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39722) Make Dataset.showString() public
[ https://issues.apache.org/jira/browse/SPARK-39722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705257#comment-17705257 ] ASF GitHub Bot commented on SPARK-39722: User 'VindhyaG' has created a pull request for this issue: https://github.com/apache/spark/pull/40553 > Make Dataset.showString() public > > > Key: SPARK-39722 > URL: https://issues.apache.org/jira/browse/SPARK-39722 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.8, 3.3.0 >Reporter: Jatin Sharma >Priority: Trivial > > Currently, we have {{.show}} APIs on a Dataset, but they print directly to > stdout. > But there are a lot of cases where we might need to get a String > representation of the show output. For example > * We have a logging framework to which we need to push the representation of > a df > * We have to send the string over a REST call from the driver > * We want to send the string to stderr instead of stdout > For such cases, currently one needs to do a hack by changing the Console.out > temporarily and catching the representation in a ByteArrayOutputStream or > similar, then extracting the string from it. > Strictly only printing to stdout seems like a limiting choice. > > Solution: > We expose APIs to return the String representation back. We already have the > .{{{}showString{}}} method internally. > > We could mirror the current {{.show}} APIS with a corresponding > {{.showString}} (and rename the internal private function to something else > if required) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42931) dropDuplicates within watermark
[ https://issues.apache.org/jira/browse/SPARK-42931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705252#comment-17705252 ] ASF GitHub Bot commented on SPARK-42931: User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/40561 > dropDuplicates within watermark > --- > > Key: SPARK-42931 > URL: https://issues.apache.org/jira/browse/SPARK-42931 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Jungtaek Lim >Priority: Major > Attachments: [External] Mini design doc_ dropDuplicates within > watermark.pdf > > > We got many reports that dropDuplicates does not clean up the state even > though they have set a watermark for the query. We document the behavior > clearly that the event time column should be a part of the subset columns for > deduplication to clean up the state, but it cannot be applied to the > customers as timestamps are not exactly the same for duplicated events in > their use cases. > We propose to deduce a new API of dropDuplicates which has following > different characteristics compared to existing dropDuplicates: > * Weaker constraints on the subset (key) > ** Does not require an event time column on the subset. > * Looser semantics on deduplication > ** Only guarantee to deduplicate events within the watermark. > Since the new API leverages event time, the new API has following new > requirements: > * The input must be streaming DataFrame. > * The watermark must be defined. > * The event time column must be defined in the input DataFrame. > More specifically on the semantic, once the operator processes the first > arrived event, events arriving within the watermark for the first event will > be deduplicated. > (Technically, the expiration time should be the “event time of the first > arrived event + watermark delay threshold”, to match up with future events.) > Users are encouraged to set the delay threshold of watermark longer than max > timestamp differences among duplicated events. (If they are unsure, they can > alternatively set the delay threshold large enough, e.g. 48 hours.) > Longer design doc will be attached. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42933) PySpark: dropDuplicates within watermark
Jungtaek Lim created SPARK-42933: Summary: PySpark: dropDuplicates within watermark Key: SPARK-42933 URL: https://issues.apache.org/jira/browse/SPARK-42933 Project: Spark Issue Type: New Feature Components: PySpark, Structured Streaming Affects Versions: 3.5.0 Reporter: Jungtaek Lim This is a follow-up issue for SPARK-42931 to ease of review for SPARK-42931. Add equivalent public API where we add in SPARK-42931 to PySpark, and also add a code example for PySpark in the guide doc. Once we merge SPARK-42931, this ticket should be ideally marked as blocker. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42932) Spark 3.3.2, with hadoop3, Error with java.io.IOException: Mkdirs failed to create file
shamim created SPARK-42932: -- Summary: Spark 3.3.2, with hadoop3, Error with java.io.IOException: Mkdirs failed to create file Key: SPARK-42932 URL: https://issues.apache.org/jira/browse/SPARK-42932 Project: Spark Issue Type: Bug Components: Spark Core, Spark Submit Affects Versions: 3.3.2 Reporter: shamim We are using spark 3.3.2 with hadoop 3 coming with spark. [https://www.apache.org/dyn/closer.lua/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz] [https://www.apache.org/dyn/closer.lua/spark/spark-3.3.2/spark-3.3.2-bin-hadoop2.tgz] Spark in our application is used as standalone , and we are not using HDFS file system. Spark is writing on local file system. Same spark version 3.3.2 is working fine with hadoop 2. but with hadoop 3 , we are getting below issue. 23/03/18 20:23:24 WARN TaskSetManager: Lost task 4.0 in stage 0.0 (TID 4) (10.64.109.72 executor 0): java.io.IOException: Mkdirs failed to create [file:/var/backup/_temporary/0/_temporary/attempt_202301182023173234741341853025716_0005_m_04_0|file://var/backup/_temporary/0/_temporary/attempt_202301182023173234741341853025716_0005_m_04_0] (exists=false, cwd=[file:/opt/spark-3.3.2/work/app-20230118202317-0001/0|file://opt/spark-3.3.0/work/app-20230118202317-0001/0]) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:515) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:500) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1195) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1081) at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:113) at org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil.initWriter(SparkHadoopWriter.scala:238) at org.apache.spark.internal.io.SparkHadoopWriter$.executeTask(SparkHadoopWriter.scala:126) at org.apache.spark.internal.io.SparkHadoopWriter$.$anonfun$write$1(SparkHadoopWriter.scala:88) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42931) dropDuplicates within watermark
[ https://issues.apache.org/jira/browse/SPARK-42931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-42931: - Attachment: [External] Mini design doc_ dropDuplicates within watermark.pdf > dropDuplicates within watermark > --- > > Key: SPARK-42931 > URL: https://issues.apache.org/jira/browse/SPARK-42931 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Jungtaek Lim >Priority: Major > Attachments: [External] Mini design doc_ dropDuplicates within > watermark.pdf > > > We got many reports that dropDuplicates does not clean up the state even > though they have set a watermark for the query. We document the behavior > clearly that the event time column should be a part of the subset columns for > deduplication to clean up the state, but it cannot be applied to the > customers as timestamps are not exactly the same for duplicated events in > their use cases. > We propose to deduce a new API of dropDuplicates which has following > different characteristics compared to existing dropDuplicates: > * Weaker constraints on the subset (key) > ** Does not require an event time column on the subset. > * Looser semantics on deduplication > ** Only guarantee to deduplicate events within the watermark. > Since the new API leverages event time, the new API has following new > requirements: > * The input must be streaming DataFrame. > * The watermark must be defined. > * The event time column must be defined in the input DataFrame. > More specifically on the semantic, once the operator processes the first > arrived event, events arriving within the watermark for the first event will > be deduplicated. > (Technically, the expiration time should be the “event time of the first > arrived event + watermark delay threshold”, to match up with future events.) > Users are encouraged to set the delay threshold of watermark longer than max > timestamp differences among duplicated events. (If they are unsure, they can > alternatively set the delay threshold large enough, e.g. 48 hours.) > Longer design doc will be attached. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42930) Change the access scope of `ProtobufSerDe` related implementations to `private[spark]`
Yang Jie created SPARK-42930: Summary: Change the access scope of `ProtobufSerDe` related implementations to `private[spark]` Key: SPARK-42930 URL: https://issues.apache.org/jira/browse/SPARK-42930 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.0, 3.5.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42931) dropDuplicates within watermark
Jungtaek Lim created SPARK-42931: Summary: dropDuplicates within watermark Key: SPARK-42931 URL: https://issues.apache.org/jira/browse/SPARK-42931 Project: Spark Issue Type: New Feature Components: Structured Streaming Affects Versions: 3.5.0 Reporter: Jungtaek Lim We got many reports that dropDuplicates does not clean up the state even though they have set a watermark for the query. We document the behavior clearly that the event time column should be a part of the subset columns for deduplication to clean up the state, but it cannot be applied to the customers as timestamps are not exactly the same for duplicated events in their use cases. We propose to deduce a new API of dropDuplicates which has following different characteristics compared to existing dropDuplicates: * Weaker constraints on the subset (key) ** Does not require an event time column on the subset. * Looser semantics on deduplication ** Only guarantee to deduplicate events within the watermark. Since the new API leverages event time, the new API has following new requirements: * The input must be streaming DataFrame. * The watermark must be defined. * The event time column must be defined in the input DataFrame. More specifically on the semantic, once the operator processes the first arrived event, events arriving within the watermark for the first event will be deduplicated. (Technically, the expiration time should be the “event time of the first arrived event + watermark delay threshold”, to match up with future events.) Users are encouraged to set the delay threshold of watermark longer than max timestamp differences among duplicated events. (If they are unsure, they can alternatively set the delay threshold large enough, e.g. 48 hours.) Longer design doc will be attached. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42929) make mapInPandas / mapInArrow support "is_barrier"
[ https://issues.apache.org/jira/browse/SPARK-42929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu reassigned SPARK-42929: -- Assignee: Weichen Xu > make mapInPandas / mapInArrow support "is_barrier" > -- > > Key: SPARK-42929 > URL: https://issues.apache.org/jira/browse/SPARK-42929 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > > make mapInPandas / mapInArrow support "is_barrier" -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42929) make mapInPandas / mapInArrow support "is_barrier"
Weichen Xu created SPARK-42929: -- Summary: make mapInPandas / mapInArrow support "is_barrier" Key: SPARK-42929 URL: https://issues.apache.org/jira/browse/SPARK-42929 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.5.0 Reporter: Weichen Xu make mapInPandas / mapInArrow support "is_barrier" -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org