[jira] [Created] (SPARK-48566) [Bug] Partition indices are incorrect when UDTF analyze() uses both select and partitionColumns

Daniel (Jira) Fri, 07 Jun 2024 12:19:04 -0700

Daniel created SPARK-48566:
------------------------------

             Summary: [Bug] Partition indices are incorrect when UDTF analyze() 
uses both select and partitionColumns
                 Key: SPARK-48566
                 URL: https://issues.apache.org/jira/browse/SPARK-48566
             Project: Spark
          Issue Type: Sub-task
          Components: PySpark
    Affects Versions: 4.0.0
            Reporter: Daniel



There is a bug that results in an internal error with some combination of the 
Python UDTF "select" and "partitionBy" options of the "analyze" method.

To reproduce:
{code:java}
from pyspark.sql.functions import (
    AnalyzeArgument,
    AnalyzeResult,
    PartitioningColumn,
    SelectedColumn,
    udtf
)

from pyspark.sql.types import (
    DoubleType,
    StringType,
    StructType,
)

@udtf
class TestTvf:
    @staticmethod
    def analyze(observed: AnalyzeArgument) -> AnalyzeResult:
        out_schema = StructType()
        out_schema.add("partition_col", StringType())
        out_schema.add("double_col", DoubleType())

        return AnalyzeResult(
            schema=out_schema,
            partitionBy=[PartitioningColumn("partition_col")],
            select=[
                SelectedColumn("partition_col"),
                SelectedColumn("double_col"),
            ],
        )

    def eval(self, *args, **kwargs):
        pass

    def terminate(self):
        for _ in range(10):
            yield {
                "partition_col": None,
                "double_col": 1.0,
            }


spark.udtf.register("serialize_test", TestTvf) 

# Fails
(
    spark
    .sql(
        """
        SELECT * FROM serialize_test(
            TABLE(
                SELECT
                    5 AS unused_col,
                    'hi' AS partition_col,
                    1.0 AS double_col
                
                UNION ALL

                SELECT
                    4 AS unused_col,
                    'hi' AS partition_col,
                    1.0 AS double_col
            )
        )
        """
    )
    .toPandas()
){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-48566) [Bug] Partition indices are incorrect when UDTF analyze() uses both select and partitionColumns

Reply via email to