[ https://issues.apache.org/jira/browse/SPARK-27519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830742#comment-16830742 ]
Bryan Cutler commented on SPARK-27519: -------------------------------------- Thanks for the script [~f7faf8ba36], I was able to reproduce with Spark 2.3.0 using pyarrow 0.8.0 and 0.12.1. With master, I did not see the issue so it could have been fixed by another Jira and will be in 3.0.0. I did not try out on Spark 2.4.0 . I'm going to close this then, but please try master or use 3.0.0 when it is released. I did notice something strange when running master though. I get rows with values of None for some reason, so if I run {{df.distinct().collect()}} then the output is {{[Row(value=[[None, None]]), Row(value=[[1, 2], [3, 4]])]}}. This does not seem related to the issue here, so I will open another JIRA. > Pandas udf corrupting data > -------------------------- > > Key: SPARK-27519 > URL: https://issues.apache.org/jira/browse/SPARK-27519 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.3.0, 3.0.0 > Reporter: Jeff gold > Priority: Major > Attachments: Pandas UDF Bug.py > > > While trying to use a pandas udf, i sent the udf 2 columns, a string and a > list of a list of strings. The second argument structure for example: > [['1'],['2'],['3']] > But when getting this same value in the udf, i receive something like this: > [['1','2'],['3'],[]] > I checked and the same row in the table has the list with the correct > structure, only in the udf did it change. > > I don't know why this happens, but i do know it has something to do with the > fact that that row was the 10,001th row and last row in it's partition. > Pandas batch size is 10,000 so that row was sent as a second batch alone, and > that's the only thing that seems to cause it, having 1 or 2 rows in a second > batch of the partition. I was also able to get this with a second batch of 2 > rows, the list wasn't changed except an empty list was added to the end. > Hope you can help me understand what is going on, thanks! -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org