[
https://issues.apache.org/jira/browse/SPARK-48566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Takuya Ueshin resolved SPARK-48566.
-----------------------------------
Fix Version/s: 4.0.0
Resolution: Fixed
Issue resolved by pull request 46918
[https://github.com/apache/spark/pull/46918]
> [Bug] Partition indices are incorrect when UDTF analyze() uses both select
> and partitionColumns
> -----------------------------------------------------------------------------------------------
>
> Key: SPARK-48566
> URL: https://issues.apache.org/jira/browse/SPARK-48566
> Project: Spark
> Issue Type: Sub-task
> Components: PySpark
> Affects Versions: 4.0.0
> Reporter: Daniel
> Assignee: Daniel
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.0.0
>
>
> There is a bug that results in an internal error with some combination of the
> Python UDTF "select" and "partitionBy" options of the "analyze" method.
> To reproduce:
> {code:java}
> from pyspark.sql.functions import (
> AnalyzeArgument,
> AnalyzeResult,
> PartitioningColumn,
> SelectedColumn,
> udtf
> )
> from pyspark.sql.types import (
> DoubleType,
> StringType,
> StructType,
> )
> @udtf
> class TestTvf:
> @staticmethod
> def analyze(observed: AnalyzeArgument) -> AnalyzeResult:
> out_schema = StructType()
> out_schema.add("partition_col", StringType())
> out_schema.add("double_col", DoubleType())
> return AnalyzeResult(
> schema=out_schema,
> partitionBy=[PartitioningColumn("partition_col")],
> select=[
> SelectedColumn("partition_col"),
> SelectedColumn("double_col"),
> ],
> )
> def eval(self, *args, **kwargs):
> pass
> def terminate(self):
> for _ in range(10):
> yield {
> "partition_col": None,
> "double_col": 1.0,
> }
> spark.udtf.register("serialize_test", TestTvf)
> # Fails
> (
> spark
> .sql(
> """
> SELECT * FROM serialize_test(
> TABLE(
> SELECT
> 5 AS unused_col,
> 'hi' AS partition_col,
> 1.0 AS double_col
>
> UNION ALL
> SELECT
> 4 AS unused_col,
> 'hi' AS partition_col,
> 1.0 AS double_col
> )
> )
> """
> )
> .toPandas()
> ){code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]