Aravind B created SPARK-12556:
----------------------------------
Summary: Pyspark dataframe unionAll call accepts incorrect input
Key: SPARK-12556
URL: https://issues.apache.org/jira/browse/SPARK-12556
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 1.4.1
Reporter: Aravind B
I actually encountered this problem with two dataframes that have 8 and 10
columns each. The below is a made up example that reproduces what I observed
going wrong.
Consider the two dataframes:
df1:
+-------+----------+
|id | count|
+-------+----------+
+-------+----------+
df2:
+-------+---------+----------+
|id |new_count| count|
+-------+---------+----------+
| 1| 4| 6|
| 1| 5| 6|
| 3| 6| 6|
| 2| 7| 6|
+-------+---------+----------+
The call:
df3 = df1.unionAll(df2)
returns successfully with df3 containing 2 cloumns. However, some columns now
have swapped values (with other columns). Based on my previous experience I
would say that df3's count column will actually be the new_count column.
I believe that this call should never complete successfully in the first place
and should throw an exception instead.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]