[jira] [Commented] (SPARK-27519) Pandas udf corrupting data

Bryan Cutler (JIRA) Tue, 30 Apr 2019 15:48:46 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-27519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830742#comment-16830742
 ]


Bryan Cutler commented on SPARK-27519:
--------------------------------------

Thanks for the script [~f7faf8ba36], I was able to reproduce with Spark 2.3.0 
using pyarrow 0.8.0 and 0.12.1. With master, I did not see the issue so it 
could have been fixed by another Jira and will be in 3.0.0. I did not try out 
on Spark 2.4.0 . I'm going to close this then, but please try master or use 
3.0.0 when it is released.

I did notice something strange when running master though. I get rows with 
values of None for some reason, so if I run {{df.distinct().collect()}} then 
the output is {{[Row(value=[[None, None]]), Row(value=[[1, 2], [3, 4]])]}}. 
This does not seem related to the issue here, so I will open another JIRA.

> Pandas udf corrupting data
> --------------------------
>
>                 Key: SPARK-27519
>                 URL: https://issues.apache.org/jira/browse/SPARK-27519
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.3.0, 3.0.0
>            Reporter: Jeff gold
>            Priority: Major
>         Attachments: Pandas UDF Bug.py
>
>
> While trying to use a pandas udf, i sent the udf 2 columns, a string and a 
> list of a list of strings. The second argument structure for example: 
> [['1'],['2'],['3']]
> But when getting this same value in the udf, i receive something like this: 
> [['1','2'],['3'],[]]
> I checked and the same row in the table has the list with the correct 
> structure, only in the udf did it change.
>  
> I don't know why this happens, but i do know it has something to do with the 
> fact that that row was the 10,001th row and last row in it's partition. 
> Pandas batch size is 10,000 so that row was sent as a second batch alone, and 
> that's the only thing that seems to cause it, having 1 or 2 rows in a second 
> batch of the partition. I was also able to get this with a second batch of 2 
> rows, the list wasn't changed except an empty list was added to the end. 
> Hope you can help me understand what is going on, thanks!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27519) Pandas udf corrupting data

Reply via email to