[jira] [Assigned] (SPARK-48566) [Bug] Partition indices are incorrect when UDTF analyze() uses both select and partitionColumns

Takuya Ueshin (Jira) Mon, 17 Jun 2024 11:14:16 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-48566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Takuya Ueshin reassigned SPARK-48566:
-------------------------------------

    Assignee: Daniel

> [Bug] Partition indices are incorrect when UDTF analyze() uses both select 
> and partitionColumns
> -----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-48566
>                 URL: https://issues.apache.org/jira/browse/SPARK-48566
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark
>    Affects Versions: 4.0.0
>            Reporter: Daniel
>            Assignee: Daniel
>            Priority: Major
>              Labels: pull-request-available
>
> There is a bug that results in an internal error with some combination of the 
> Python UDTF "select" and "partitionBy" options of the "analyze" method.
> To reproduce:
> {code:java}
> from pyspark.sql.functions import (
>     AnalyzeArgument,
>     AnalyzeResult,
>     PartitioningColumn,
>     SelectedColumn,
>     udtf
> )
> from pyspark.sql.types import (
>     DoubleType,
>     StringType,
>     StructType,
> )
> @udtf
> class TestTvf:
>     @staticmethod
>     def analyze(observed: AnalyzeArgument) -> AnalyzeResult:
>         out_schema = StructType()
>         out_schema.add("partition_col", StringType())
>         out_schema.add("double_col", DoubleType())
>         return AnalyzeResult(
>             schema=out_schema,
>             partitionBy=[PartitioningColumn("partition_col")],
>             select=[
>                 SelectedColumn("partition_col"),
>                 SelectedColumn("double_col"),
>             ],
>         )
>     def eval(self, *args, **kwargs):
>         pass
>     def terminate(self):
>         for _ in range(10):
>             yield {
>                 "partition_col": None,
>                 "double_col": 1.0,
>             }
> spark.udtf.register("serialize_test", TestTvf) 
> # Fails
> (
>     spark
>     .sql(
>         """
>         SELECT * FROM serialize_test(
>             TABLE(
>                 SELECT
>                     5 AS unused_col,
>                     'hi' AS partition_col,
>                     1.0 AS double_col
>                 
>                 UNION ALL
>                 SELECT
>                     4 AS unused_col,
>                     'hi' AS partition_col,
>                     1.0 AS double_col
>             )
>         )
>         """
>     )
>     .toPandas()
> ){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Assigned] (SPARK-48566) [Bug] Partition indices are incorrect when UDTF analyze() uses both select and partitionColumns

Reply via email to