GUAN Hao created SPARK-16869:
--------------------------------
Summary: Wrong projection when join on columns with the same name
which are derived from the same dataframe
Key: SPARK-16869
URL: https://issues.apache.org/jira/browse/SPARK-16869
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.0.0
Reporter: GUAN Hao
I have to DataFrames, both contain a column named *i* which are derived from a
same DataFrame (join).
{code}
b
+---+---+---+---+
| j| p| i| k|
+---+---+---+---+
| 3| 2| 3| 3|
| 2| 1| 2| 2|
+---+---+---+---+
c
+---+---+---+---+
| j| k| q| i|
+---+---+---+---+
| 1| 1| 0| 1|
| 2| 2| 1| 2|
+---+---+---+---+
{code}
The result of OUTER join of two DataFrames above is:
{code}
i = colaesce(b.i, c.i)
+----+----+----+---+---+----+----+
| b_i| c_i| i| j| k| p| q|
+----+----+----+---+---+----+----+
| 2| 2| 2| 2| 2| 1| 1|
|null| 1| 1| 1| 1|null| 0|
| 3|null| 3| 3| 3| 2|null|
+----+----+----+---+---+----+----+
{code}
However, what I got is:
{code}
+----+----+----+---+---+----+----+
| b_i| c_i| i| j| k| p| q|
+----+----+----+---+---+----+----+
| 2| 2| 2| 2| 2| 1| 1|
|null|null|null| 1| 1|null| 0|
| 3| 3| 3| 3| 3| 2|null|
+----+----+----+---+---+----+----+
{code}
{code}
== Physical Plan ==
*Project [i#0L AS b_i#146L, i#0L AS c_i#147L, coalesce(i#0L, i#0L) AS i#148L,
coalesce(j#12L, j#21L) AS j#149L, coalesce(k#2L, k#22L) AS k#150L, p#13L, q#23L]
+- SortMergeJoin [i#0L, j#12L, k#2L], [i#113L, j#21L, k#22L], FullOuter
....
{code}
As shown in the plan, columns {{b.i}} and {{c.i}} are correctly resolved to
{{i#0L}} and {{i#113L}} correspondingly in the join condition part. However,
in the projection part, both {{b.i}} and {{c.i}} are resolved to {{i#0L}}.
Complete code to re-produce:
{code}
from pyspark import SparkContext, SQLContext
from pyspark.sql import Row, functions
sc = SparkContext()
sqlContext = SQLContext(sc)
data_a = sc.parallelize([
Row(i=1, j=1, k=1),
Row(i=2, j=2, k=2),
Row(i=3, j=3, k=3),
])
table_a = sqlContext.createDataFrame(data_a)
table_a.show()
data_b = sc.parallelize([
Row(j=2, p=1),
Row(j=3, p=2),
])
table_b = sqlContext.createDataFrame(data_b)
table_b.show()
data_c = sc.parallelize([
Row(j=1, k=1, q=0),
Row(j=2, k=2, q=1),
])
table_c = sqlContext.createDataFrame(data_c)
table_c.show()
b = table_b.join(table_a, table_b.j == table_a.j).drop(table_a.j)
c = table_c.join(table_a, (table_c.j == table_a.j)
& (table_c.k == table_a.k)) \
.drop(table_a.j) \
.drop(table_a.k)
b.show()
c.show()
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]