Simone created SPARK-13301:
------------------------------

             Summary: PySpark Dataframe return wrong results with custom UDF
                 Key: SPARK-13301
                 URL: https://issues.apache.org/jira/browse/SPARK-13301
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.5.0
         Environment: PySpark - CDH 5.5.1
            Reporter: Simone
            Priority: Critical


Using a User Defined Function in PySpark inside the withColumn() method of 
Dataframe, gives wrong results.

Here an example:

# UDF the returs the lower version of a string
from pyspark.sql import functions
import string
myFunc = functions.udf(lambda x: string.lower(x))

myDF.select("col1", "col2").withColumn("col3", myFunc(myDF["col1"])).show()
+--------------------+-----------+--------------------+
|                col1|       col2|                col3|
+--------------------+-----------+--------------------+
|1265AB4F65C05740E...|        Ivo|4f00ae514e7c015be...|
|1D94AB4F75C83B51E...|   Raffaele|4f00dcf6422100c0e...|
|4F008903600A0133E...|   Cristina|4f008903600a0133e...|

The results are wrong and seem to be random: some record are OK (for example 
the third) some others NO (for example the first 2).

The problem seems not occur with Spark built-in functions:
from pyspark.sql.functions import *
myDF.select("col1", "col2").withColumn("col3", lower(myDF["col1"])).show()

Without the withColumn() method, results seems to be always correct:
myDF.select("col1", "col2", myFunc(myDF["col1"])).show()
This can be considered only in part a workaround because you have to list each 
time all column of your Dataframe.

Also in Scala/Java the problems seems not occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to