[ 
https://issues.apache.org/jira/browse/SPARK-13301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386564#comment-15386564
 ] 

Sam Fishman commented on SPARK-13301:
-------------------------------------

I am having the same issue when applying a udf to a DataFrame. I've noticed 
that it seems to only occur when data is read in using sqlContext.sql(). When I 
use parallelize on a local Python collection and then call toDF on the 
resulting RDD, I don't seem to have the issue:

Works:
data = [ ["1", "A->B"], ["2", "B->C"], ["3", "C->A"], ["4", "D->E"], ["5", 
"E->D"] ]
rdd = sc.parallelize(data)
df = rdd.toDF(["id", "segment"])
def myFunction(number):
    id_test = number+ ' test'
    return(id_test)
test_function_udf = udf(myFunction, StringType())
df2 = df.withColumn('test', test_function_udf("id"))

Does not work (ie "wrong results" similar to Simone's output):
# Assuming I have data stored in a table
df = sqlContext.sql("select * from my_table")
def myFunction(number):
    id_test = number+ ' test'
    return(id_test)
test_function_udf = udf(myFunction, StringType())
df2 = df.withColumn('test', test_function_udf("id"))

I should note that this also only seems to be an issue when using a UDF. If I 
use the builtin concat function, I do not get the error. 


> PySpark Dataframe return wrong results with custom UDF
> ------------------------------------------------------
>
>                 Key: SPARK-13301
>                 URL: https://issues.apache.org/jira/browse/SPARK-13301
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0
>         Environment: PySpark in yarn-client mode - CDH 5.5.1
>            Reporter: Simone
>            Priority: Critical
>
> Using a User Defined Function in PySpark inside the withColumn() method of 
> Dataframe, gives wrong results.
> Here an example:
> from pyspark.sql import functions
> import string
> myFunc = functions.udf(lambda s: string.lower(s))
> myDF.select("col1", "col2").withColumn("col3", myFunc(myDF["col1"])).show()
> |                col1|       col2|                col3|
> |1265AB4F65C05740E...|        Ivo|4f00ae514e7c015be...|
> |1D94AB4F75C83B51E...|   Raffaele|4f00dcf6422100c0e...|
> |4F008903600A0133E...|   Cristina|4f008903600a0133e...|
> The results are wrong and seem to be random: some record are OK (for example 
> the third) some others NO (for example the first 2).
> The problem seems not occur with Spark built-in functions:
> from pyspark.sql.functions import *
> myDF.select("col1", "col2").withColumn("col3", lower(myDF["col1"])).show()
> Without the withColumn() method, results seems to be always correct:
> myDF.select("col1", "col2", myFunc(myDF["col1"])).show()
> This can be considered only in part a workaround because you have to list 
> each time all column of your Dataframe.
> Also in Scala/Java the problems seems not occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to