Daniel created SPARK-48566:
------------------------------
Summary: [Bug] Partition indices are incorrect when UDTF analyze()
uses both select and partitionColumns
Key: SPARK-48566
URL: https://issues.apache.org/jira/browse/SPARK-48566
Project: Spark
Issue Type: Sub-task
Components: PySpark
Affects Versions: 4.0.0
Reporter: Daniel
There is a bug that results in an internal error with some combination of the
Python UDTF "select" and "partitionBy" options of the "analyze" method.
To reproduce:
{code:java}
from pyspark.sql.functions import (
AnalyzeArgument,
AnalyzeResult,
PartitioningColumn,
SelectedColumn,
udtf
)
from pyspark.sql.types import (
DoubleType,
StringType,
StructType,
)
@udtf
class TestTvf:
@staticmethod
def analyze(observed: AnalyzeArgument) -> AnalyzeResult:
out_schema = StructType()
out_schema.add("partition_col", StringType())
out_schema.add("double_col", DoubleType())
return AnalyzeResult(
schema=out_schema,
partitionBy=[PartitioningColumn("partition_col")],
select=[
SelectedColumn("partition_col"),
SelectedColumn("double_col"),
],
)
def eval(self, *args, **kwargs):
pass
def terminate(self):
for _ in range(10):
yield {
"partition_col": None,
"double_col": 1.0,
}
spark.udtf.register("serialize_test", TestTvf)
# Fails
(
spark
.sql(
"""
SELECT * FROM serialize_test(
TABLE(
SELECT
5 AS unused_col,
'hi' AS partition_col,
1.0 AS double_col
UNION ALL
SELECT
4 AS unused_col,
'hi' AS partition_col,
1.0 AS double_col
)
)
"""
)
.toPandas()
){code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]