[
https://issues.apache.org/jira/browse/SPARK-31930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julia Maddalena updated SPARK-31930:
------------------------------------
Description:
Attempting to return an ArrayType() from pandas_udf reveals a consistent error
with skipping specific list elements upon return.
We were able to create a reproducible example, as below.
{code:java}
df = spark.createDataFrame([('A', 1), ('A', 2), ('B', 5), ('B', 6), ('C', 10)],
['group', 'val'])
@pandas_udf(ArrayType(ArrayType(LongType())), PandasUDFType.GROUPED_AGG)
def get_list(x):
return [[1,1], [2,2], [3,3], [4,4], [5,5], [6,6], [7,7], [8,8]]
df.groupby('group').agg(get_list(df['val']).alias('list_col')).show(3, False)
{code}
{code:java}
+-----+-----------------------------+
|group|list_col |
+-----+-----------------------------+
|B |[[1, 1],,,,,, [7, 7], [8, 8]]|
|C |[[1, 1],,,,,, [7, 7], [8, 8]]|
|A |[[1, 1],,,,,, [7, 7], [8, 8]]|
+-----+-----------------------------+
{code}
In every example we've come up with, it consistently replaces elements 2-6 with
None (as well as some later elements too).
was:
Attempting to return an ArrayType() from pandas_udf reveals a consistent error
with skipping specific list elements upon return.
We were able to create a reproducible example, as below.
{code:java}
df = spark.createDataFrame([('A', 1), ('A', 2), ('B', 5), ('B', 6), ('C', 10)],
['group', 'val'])
@pandas_udf(ArrayType(ArrayType(LongType())), PandasUDFType.GROUPED_AGG)
def get_list(x):
return [[1,1], [2,2], [3,3], [4,4], [5,5], [6,6], [7,7], [8,8]]
df.groupby('group').agg(get_list(df['val']).alias('list_col')).show(3, False)
{code}
{code:java}
+-----+-----------------------------+
|group|list_col |
+-----+-----------------------------+
|B |[[1, 1],,,,,, [7, 7], [8, 8]]|
|C |[[1, 1],,,,,, [7, 7], [8, 8]]|
|A |[[1, 1],,,,,, [7, 7], [8, 8]]|
+-----+-----------------------------+
{code}
In every example we've come up with, it consistently replaces elements 2-6 with
None (as well as some later elements too).
> Pandas_udf does not properly return ArrayType
> ---------------------------------------------
>
> Key: SPARK-31930
> URL: https://issues.apache.org/jira/browse/SPARK-31930
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 2.4.3
> Environment: Azure Databricks
> Reporter: Julia Maddalena
> Priority: Blocker
>
> Attempting to return an ArrayType() from pandas_udf reveals a consistent
> error with skipping specific list elements upon return.
> We were able to create a reproducible example, as below.
> {code:java}
> df = spark.createDataFrame([('A', 1), ('A', 2), ('B', 5), ('B', 6), ('C',
> 10)], ['group', 'val'])
> @pandas_udf(ArrayType(ArrayType(LongType())), PandasUDFType.GROUPED_AGG)
> def get_list(x):
> return [[1,1], [2,2], [3,3], [4,4], [5,5], [6,6], [7,7], [8,8]]
> df.groupby('group').agg(get_list(df['val']).alias('list_col')).show(3, False)
> {code}
> {code:java}
> +-----+-----------------------------+
> |group|list_col |
> +-----+-----------------------------+
> |B |[[1, 1],,,,,, [7, 7], [8, 8]]|
> |C |[[1, 1],,,,,, [7, 7], [8, 8]]|
> |A |[[1, 1],,,,,, [7, 7], [8, 8]]|
> +-----+-----------------------------+
> {code}
>
>
> In every example we've come up with, it consistently replaces elements 2-6
> with None (as well as some later elements too).
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]