[
https://issues.apache.org/jira/browse/SPARK-13301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Simone updated SPARK-13301:
---------------------------
Description:
Using a User Defined Function in PySpark inside the withColumn() method of
Dataframe, gives wrong results.
Here an example:
from pyspark.sql import functions
import string
myFunc = functions.udf(lambda x: string.lower(x))
myDF.select("col1", "col2").withColumn("col3", myFunc(myDF["col1"])).show()
| col1| col2| col3|
|1265AB4F65C05740E...| Ivo|4f00ae514e7c015be...|
|1D94AB4F75C83B51E...| Raffaele|4f00dcf6422100c0e...|
|4F008903600A0133E...| Cristina|4f008903600a0133e...|
The results are wrong and seem to be random: some record are OK (for example
the third) some others NO (for example the first 2).
The problem seems not occur with Spark built-in functions:
from pyspark.sql.functions import *
myDF.select("col1", "col2").withColumn("col3", lower(myDF["col1"])).show()
Without the withColumn() method, results seems to be always correct:
myDF.select("col1", "col2", myFunc(myDF["col1"])).show()
This can be considered only in part a workaround because you have to list each
time all column of your Dataframe.
Also in Scala/Java the problems seems not occur.
was:
Using a User Defined Function in PySpark inside the withColumn() method of
Dataframe, gives wrong results.
Here an example:
from pyspark.sql import functions
import string
myFunc = functions.udf(lambda x: string.lower(x))
myDF.select("col1", "col2").withColumn("col3", myFunc(myDF["col1"])).show()
+--------------------+-----------+--------------------+
| col1| col2| col3|
+--------------------+-----------+--------------------+
|1265AB4F65C05740E...| Ivo|4f00ae514e7c015be...|
|1D94AB4F75C83B51E...| Raffaele|4f00dcf6422100c0e...|
|4F008903600A0133E...| Cristina|4f008903600a0133e...|
The results are wrong and seem to be random: some record are OK (for example
the third) some others NO (for example the first 2).
The problem seems not occur with Spark built-in functions:
from pyspark.sql.functions import *
myDF.select("col1", "col2").withColumn("col3", lower(myDF["col1"])).show()
Without the withColumn() method, results seems to be always correct:
myDF.select("col1", "col2", myFunc(myDF["col1"])).show()
This can be considered only in part a workaround because you have to list each
time all column of your Dataframe.
Also in Scala/Java the problems seems not occur.
> PySpark Dataframe return wrong results with custom UDF
> ------------------------------------------------------
>
> Key: SPARK-13301
> URL: https://issues.apache.org/jira/browse/SPARK-13301
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.5.0
> Environment: PySpark - CDH 5.5.1
> Reporter: Simone
> Priority: Critical
>
> Using a User Defined Function in PySpark inside the withColumn() method of
> Dataframe, gives wrong results.
> Here an example:
> from pyspark.sql import functions
> import string
> myFunc = functions.udf(lambda x: string.lower(x))
> myDF.select("col1", "col2").withColumn("col3", myFunc(myDF["col1"])).show()
> | col1| col2| col3|
> |1265AB4F65C05740E...| Ivo|4f00ae514e7c015be...|
> |1D94AB4F75C83B51E...| Raffaele|4f00dcf6422100c0e...|
> |4F008903600A0133E...| Cristina|4f008903600a0133e...|
> The results are wrong and seem to be random: some record are OK (for example
> the third) some others NO (for example the first 2).
> The problem seems not occur with Spark built-in functions:
> from pyspark.sql.functions import *
> myDF.select("col1", "col2").withColumn("col3", lower(myDF["col1"])).show()
> Without the withColumn() method, results seems to be always correct:
> myDF.select("col1", "col2", myFunc(myDF["col1"])).show()
> This can be considered only in part a workaround because you have to list
> each time all column of your Dataframe.
> Also in Scala/Java the problems seems not occur.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]