[
https://issues.apache.org/jira/browse/SPARK-27519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-27519:
---------------------------------
Affects Version/s: 3.0.0
> Pandas udf corrupting data
> --------------------------
>
> Key: SPARK-27519
> URL: https://issues.apache.org/jira/browse/SPARK-27519
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 2.3.0, 3.0.0
> Reporter: Jeff gold
> Priority: Major
> Attachments: Pandas UDF Bug.py
>
>
> While trying to use a pandas udf, i sent the udf 2 columns, a string and a
> list of a list of strings. The second argument structure for example:
> [['1'],['2'],['3']]
> But when getting this same value in the udf, i receive something like this:
> [['1','2'],['3'],[]]
> I checked and the same row in the table has the list with the correct
> structure, only in the udf did it change.
>
> I don't know why this happens, but i do know it has something to do with the
> fact that that row was the 10,001th row and last row in it's partition.
> Pandas batch size is 10,000 so that row was sent as a second batch alone, and
> that's the only thing that seems to cause it, having 1 or 2 rows in a second
> batch of the partition. I was also able to get this with a second batch of 2
> rows, the list wasn't changed except an empty list was added to the end.
> Hope you can help me understand what is going on, thanks!
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]