Simone created SPARK-13301: ------------------------------ Summary: PySpark Dataframe return wrong results with custom UDF Key: SPARK-13301 URL: https://issues.apache.org/jira/browse/SPARK-13301 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Environment: PySpark - CDH 5.5.1 Reporter: Simone Priority: Critical
Using a User Defined Function in PySpark inside the withColumn() method of Dataframe, gives wrong results. Here an example: # UDF the returs the lower version of a string from pyspark.sql import functions import string myFunc = functions.udf(lambda x: string.lower(x)) myDF.select("col1", "col2").withColumn("col3", myFunc(myDF["col1"])).show() +--------------------+-----------+--------------------+ | col1| col2| col3| +--------------------+-----------+--------------------+ |1265AB4F65C05740E...| Ivo|4f00ae514e7c015be...| |1D94AB4F75C83B51E...| Raffaele|4f00dcf6422100c0e...| |4F008903600A0133E...| Cristina|4f008903600a0133e...| The results are wrong and seem to be random: some record are OK (for example the third) some others NO (for example the first 2). The problem seems not occur with Spark built-in functions: from pyspark.sql.functions import * myDF.select("col1", "col2").withColumn("col3", lower(myDF["col1"])).show() Without the withColumn() method, results seems to be always correct: myDF.select("col1", "col2", myFunc(myDF["col1"])).show() This can be considered only in part a workaround because you have to list each time all column of your Dataframe. Also in Scala/Java the problems seems not occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org