[jira] [Updated] (SPARK-27519) Pandas udf corrupting data

Hyukjin Kwon (JIRA) Sun, 28 Apr 2019 18:06:10 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-27519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon updated SPARK-27519:
---------------------------------
    Affects Version/s: 3.0.0

> Pandas udf corrupting data
> --------------------------
>
>                 Key: SPARK-27519
>                 URL: https://issues.apache.org/jira/browse/SPARK-27519
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.3.0, 3.0.0
>            Reporter: Jeff gold
>            Priority: Major
>         Attachments: Pandas UDF Bug.py
>
>
> While trying to use a pandas udf, i sent the udf 2 columns, a string and a 
> list of a list of strings. The second argument structure for example: 
> [['1'],['2'],['3']]
> But when getting this same value in the udf, i receive something like this: 
> [['1','2'],['3'],[]]
> I checked and the same row in the table has the list with the correct 
> structure, only in the udf did it change.
>  
> I don't know why this happens, but i do know it has something to do with the 
> fact that that row was the 10,001th row and last row in it's partition. 
> Pandas batch size is 10,000 so that row was sent as a second batch alone, and 
> that's the only thing that seems to cause it, having 1 or 2 rows in a second 
> batch of the partition. I was also able to get this with a second batch of 2 
> rows, the list wasn't changed except an empty list was added to the end. 
> Hope you can help me understand what is going on, thanks!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-27519) Pandas udf corrupting data

Reply via email to