Luis Guerra created SPARK-9131:
----------------------------------
Summary: UDF change data values
Key: SPARK-9131
URL: https://issues.apache.org/jira/browse/SPARK-9131
Project: Spark
Issue Type: Bug
Components: PySpark, SQL
Affects Versions: 1.4.0
Environment: Pyspark 1.4, Redhat 6.6
Reporter: Luis Guerra
Priority: Critical
I am having some troubles when using a custom udf in dataframes with pyspark
1.4.
I have rewritten the udf to simplify the problem and it gets even weirder. The
udfs I am using do absolutely nothing, they just receive some value and output
the same value with the same format.
I show you my code below:
c= a.join(b, a['ID'] == b['ID_new'], 'inner')
c.filter(c['ID'] == 'XX').show()
udf_A = UserDefinedFunction(lambda x: x, DateType())
udf_B = UserDefinedFunction(lambda x: x, DateType())
udf_C = UserDefinedFunction(lambda x: x, DateType())
d = c.select(c['ID'], c['t1'].alias('ta'),
udf_A(vinc_muestra['t2']).alias('tb'), udf_B(vinc_muestra['t1']).alias('tc'),
udf_C(vinc_muestra['t2']).alias('td'))
d.filter(d['ID'] == 'XX').show()
I am showing here the results from the outputs:
+----------------+----------------+----------+----------+
| ID | ID_new | t1 | t2 |
+----------------+----------------+----------+----------+
|6000000002698917| 6000000002698917| 2012-02-28| 2014-02-28|
|6000000002698917| 6000000002698917| 2012-02-20| 2013-02-20|
|6000000002698917| 6000000002698917| 2012-02-28| 2014-02-28|
|6000000002698917| 6000000002698917| 2012-02-20| 2013-02-20|
|6000000002698917| 6000000002698917| 2012-02-20| 2013-02-20|
|6000000002698917| 6000000002698917| 2012-02-28| 2014-02-28|
|6000000002698917| 6000000002698917| 2012-02-28| 2014-02-28|
|6000000002698917| 6000000002698917| 2012-02-20| 2013-02-20|
+----------------+----------------+----------+----------+
+----------------+---------------+---------------+------------+------------+
| ID | ta | tb | tc | td
|
+----------------+---------------+---------------+------------+------------+
|6000000002698917| 2012-02-28| 2007-03-05| 2003-03-05| 20140228|
|6000000002698917| 2012-02-20| 2007-02-15| 20020215| 20130220|
|6000000002698917| 2012-02-28| 2007-03-10| 20050310| 20140228|
|6000000002698917| 2012-02-20| 20070305| 2003-03-05| 20130220|
|6000000002698917| 2012-02-20| 2013-08-02| 2013-01-02|
2013-02-20|
|6000000002698917| 2012-02-28| 2007-02-15| 20020215| 2014-02-28|
|6000000002698917| 2012-02-28| 20070215| 2002-02-15| 2014-02-28|
|6000000002698917| 2012-02-20| 2014-01-02| 2013-01-02|
2013-02-20|
+----------------+---------------+---------------+------------+------------+
The here is that values at columns 'tb', 'tc' and 'td' in dataframe 'd' are
completely different from values 't1' and 't2' in dataframe c even when my udfs
are doing nothing. It seems like if values were somehow got from other
registers (or just invented). Results are different between executions
(apparently random).
Thanks in advance
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]