[ https://issues.apache.org/jira/browse/SPARK-13301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386564#comment-15386564 ]
Sam Fishman commented on SPARK-13301: ------------------------------------- I am having the same issue when applying a udf to a DataFrame. I've noticed that it seems to only occur when data is read in using sqlContext.sql(). When I use parallelize on a local Python collection and then call toDF on the resulting RDD, I don't seem to have the issue: Works: data = [ ["1", "A->B"], ["2", "B->C"], ["3", "C->A"], ["4", "D->E"], ["5", "E->D"] ] rdd = sc.parallelize(data) df = rdd.toDF(["id", "segment"]) def myFunction(number): id_test = number+ ' test' return(id_test) test_function_udf = udf(myFunction, StringType()) df2 = df.withColumn('test', test_function_udf("id")) Does not work (ie "wrong results" similar to Simone's output): # Assuming I have data stored in a table df = sqlContext.sql("select * from my_table") def myFunction(number): id_test = number+ ' test' return(id_test) test_function_udf = udf(myFunction, StringType()) df2 = df.withColumn('test', test_function_udf("id")) I should note that this also only seems to be an issue when using a UDF. If I use the builtin concat function, I do not get the error. > PySpark Dataframe return wrong results with custom UDF > ------------------------------------------------------ > > Key: SPARK-13301 > URL: https://issues.apache.org/jira/browse/SPARK-13301 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.5.0 > Environment: PySpark in yarn-client mode - CDH 5.5.1 > Reporter: Simone > Priority: Critical > > Using a User Defined Function in PySpark inside the withColumn() method of > Dataframe, gives wrong results. > Here an example: > from pyspark.sql import functions > import string > myFunc = functions.udf(lambda s: string.lower(s)) > myDF.select("col1", "col2").withColumn("col3", myFunc(myDF["col1"])).show() > | col1| col2| col3| > |1265AB4F65C05740E...| Ivo|4f00ae514e7c015be...| > |1D94AB4F75C83B51E...| Raffaele|4f00dcf6422100c0e...| > |4F008903600A0133E...| Cristina|4f008903600a0133e...| > The results are wrong and seem to be random: some record are OK (for example > the third) some others NO (for example the first 2). > The problem seems not occur with Spark built-in functions: > from pyspark.sql.functions import * > myDF.select("col1", "col2").withColumn("col3", lower(myDF["col1"])).show() > Without the withColumn() method, results seems to be always correct: > myDF.select("col1", "col2", myFunc(myDF["col1"])).show() > This can be considered only in part a workaround because you have to list > each time all column of your Dataframe. > Also in Scala/Java the problems seems not occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org