[ https://issues.apache.org/jira/browse/SPARK-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Luis Guerra updated SPARK-9131: ------------------------------- Attachment: testjson_jira9131.z02 testjson_jira9131.z03 testjson_jira9131.z05 testjson_jira9131.z04 testjson_jira9131.zip testjson_jira9131.z06 testjson_jira9131.z01 I hope they work fine. I have split them into several files to reach the limit size > UDFs change data values > ----------------------- > > Key: SPARK-9131 > URL: https://issues.apache.org/jira/browse/SPARK-9131 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Affects Versions: 1.4.0, 1.4.1 > Environment: Pyspark 1.4 and 1.4.1 > Reporter: Luis Guerra > Priority: Critical > Attachments: testjson_jira9131.z01, testjson_jira9131.z02, > testjson_jira9131.z03, testjson_jira9131.z04, testjson_jira9131.z05, > testjson_jira9131.z06, testjson_jira9131.zip > > > I am having some troubles when using a custom udf in dataframes with pyspark > 1.4. > I have rewritten the udf to simplify the problem and it gets even weirder. > The udfs I am using do absolutely nothing, they just receive some value and > output the same value with the same format. > I show you my code below: > {code} > c= a.join(b, a['ID'] == b['ID_new'], 'inner') > c.filter(c['ID'] == '6000000002698917').show() > udf_A = UserDefinedFunction(lambda x: x, DateType()) > udf_B = UserDefinedFunction(lambda x: x, DateType()) > udf_C = UserDefinedFunction(lambda x: x, DateType()) > d = c.select(c['ID'], c['t1'].alias('ta'), > udf_A(vinc_muestra['t2']).alias('tb'), udf_B(vinc_muestra['t1']).alias('tc'), > udf_C(vinc_muestra['t2']).alias('td')) > d.filter(d['ID'] == '6000000002698917').show() > {code} > I am showing here the results from the outputs: > {code} > +----------------+----------------+----------+----------+ > | ID | ID_new | t1 | t2 | > +----------------+----------------+----------+----------+ > |6000000002698917| 6000000002698917| 2012-02-28| 2014-02-28| > |6000000002698917| 6000000002698917| 2012-02-20| 2013-02-20| > |6000000002698917| 6000000002698917| 2012-02-28| 2014-02-28| > |6000000002698917| 6000000002698917| 2012-02-20| 2013-02-20| > |6000000002698917| 6000000002698917| 2012-02-20| 2013-02-20| > |6000000002698917| 6000000002698917| 2012-02-28| 2014-02-28| > |6000000002698917| 6000000002698917| 2012-02-28| 2014-02-28| > |6000000002698917| 6000000002698917| 2012-02-20| 2013-02-20| > +----------------+----------------+----------+----------+ > +----------------+---------------+---------------+------------+------------+ > | ID | ta | tb | tc | td > | > +----------------+---------------+---------------+------------+------------+ > |6000000002698917| 2012-02-28| 2007-03-05| 2003-03-05| > 2014-02-28| > |6000000002698917| 2012-02-20| 2007-02-15| 2002-02-15| > 2013-02-20| > |6000000002698917| 2012-02-28| 2007-03-10| 2005-03-10| > 2014-02-28| > |6000000002698917| 2012-02-20| 2007-03-05| 2003-03-05| > 2013-02-20| > |6000000002698917| 2012-02-20| 2013-08-02| 2013-01-02| > 2013-02-20| > |6000000002698917| 2012-02-28| 2007-02-15| 2002-02-15| > 2014-02-28| > |6000000002698917| 2012-02-28| 2007-02-15| 2002-02-15| > 2014-02-28| > |6000000002698917| 2012-02-20| 2014-01-02| 2013-01-02| > 2013-02-20| > +----------------+---------------+---------------+------------+------------+ > {code} > The problem here is that values at columns 'tb', 'tc' and 'td' in dataframe > 'd' are completely different from values 't1' and 't2' in dataframe c even > when my udfs are doing nothing. It seems like if values were somehow got from > other registers (or just invented). Results are different between executions > (apparently random). > Thanks in advance -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org