[jira] [Updated] (SPARK-9131) UDFs change data values

Luis Guerra (JIRA) Wed, 22 Jul 2015 01:29:20 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Luis Guerra updated SPARK-9131:
-------------------------------
    Attachment: testjson_jira9131.z02
                testjson_jira9131.z03
                testjson_jira9131.z05
                testjson_jira9131.z04
                testjson_jira9131.zip
                testjson_jira9131.z06
                testjson_jira9131.z01

I hope they work fine. I have split them into several files to reach the limit 
size

> UDFs change data values
> -----------------------
>
>                 Key: SPARK-9131
>                 URL: https://issues.apache.org/jira/browse/SPARK-9131
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 1.4.0, 1.4.1
>         Environment: Pyspark 1.4 and 1.4.1
>            Reporter: Luis Guerra
>            Priority: Critical
>         Attachments: testjson_jira9131.z01, testjson_jira9131.z02, 
> testjson_jira9131.z03, testjson_jira9131.z04, testjson_jira9131.z05, 
> testjson_jira9131.z06, testjson_jira9131.zip
>
>
> I am having some troubles when using a custom udf in dataframes with pyspark 
> 1.4.
> I have rewritten the udf to simplify the problem and it gets even weirder. 
> The udfs I am using do absolutely nothing, they just receive some value and 
> output the same value with the same format.
> I show you my code below:
> {code}
> c= a.join(b, a['ID'] == b['ID_new'], 'inner')
> c.filter(c['ID'] == '6000000002698917').show()
> udf_A = UserDefinedFunction(lambda x: x, DateType())
> udf_B = UserDefinedFunction(lambda x: x, DateType())
> udf_C = UserDefinedFunction(lambda x: x, DateType())
> d = c.select(c['ID'], c['t1'].alias('ta'), 
> udf_A(vinc_muestra['t2']).alias('tb'), udf_B(vinc_muestra['t1']).alias('tc'), 
> udf_C(vinc_muestra['t2']).alias('td'))
> d.filter(d['ID'] == '6000000002698917').show()
> {code}
> I am showing here the results from the outputs:
> {code}
> +----------------+----------------+----------+----------+
> |          ID     |     ID_new  |     t1       |   t2     |
> +----------------+----------------+----------+----------+
> |6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
> |6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
> |6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
> |6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
> |6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
> |6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
> |6000000002698917|   6000000002698917|   2012-02-28|   2014-02-28|
> |6000000002698917|   6000000002698917|   2012-02-20|   2013-02-20|
> +----------------+----------------+----------+----------+
> +----------------+---------------+---------------+------------+------------+
> |       ID        |       ta     |       tb        |   tc        |     td     
>   |
> +----------------+---------------+---------------+------------+------------+
> |6000000002698917|     2012-02-28|       2007-03-05|    2003-03-05|    
> 2014-02-28|
> |6000000002698917|     2012-02-20|       2007-02-15|    2002-02-15|    
> 2013-02-20|
> |6000000002698917|     2012-02-28|       2007-03-10|    2005-03-10|    
> 2014-02-28|
> |6000000002698917|     2012-02-20|       2007-03-05|    2003-03-05|    
> 2013-02-20|
> |6000000002698917|     2012-02-20|       2013-08-02|    2013-01-02|    
> 2013-02-20|
> |6000000002698917|     2012-02-28|       2007-02-15|    2002-02-15|    
> 2014-02-28|
> |6000000002698917|     2012-02-28|       2007-02-15|    2002-02-15|    
> 2014-02-28|
> |6000000002698917|     2012-02-20|       2014-01-02|    2013-01-02|    
> 2013-02-20|
> +----------------+---------------+---------------+------------+------------+
> {code}
> The problem here is that values at columns 'tb', 'tc' and 'td' in dataframe 
> 'd' are completely different from values 't1' and 't2' in dataframe c even 
> when my udfs are doing nothing. It seems like if values were somehow got from 
> other registers (or just invented). Results are different between executions 
> (apparently random).
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-9131) UDFs change data values

Reply via email to