Ian Hellstrom created SPARK-15527:
-------------------------------------
Summary: Duplicate column names with different case after join of
DataFrames
Key: SPARK-15527
URL: https://issues.apache.org/jira/browse/SPARK-15527
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 1.4.1
Reporter: Ian Hellstrom
Fix For: 1.6.0
Column names can be duplicated when the cases (upper/lower/mixed) do not match
in 1.4.1. In 1.6.0, I have checked it and Spark behaves as expected: the join
columns are matched in a case-sensitive fashion. In 1.4.1 joins appear to be
case-insensitive even though the results are inconsistent.
I did not find a related ticket, hence I'm opening this one even though it's
technically fixed, just in case this happens to be a coincidence.
Here's a minimal example to check:
{code}
case class Test(id: Int, value: String)
val lhs = sc.parallelize(List(Test(1, "A"), Test(2, "B"), Test(3, "C"))).toDF
val rhs = sc.parallelize(List(Test(1, "AA"), Test(2, "BB"), Test(4, "D"))).toDF
val rhsId = rhs.withColumnRenamed("id", "ID")
val full = lhs.join(rhs, "id")
val fullId = lhs.join(rhsId, "id") // both id and ID in result in 1.4.1
val fullID = lhs.join(rhsId, "ID") // only id in result in 1.4.1
{code}
The last two joins don't execute on 1.6.0 because "id" is not found in rhsId
(first case) and "ID" is not found in lhs (second case). On 1.4.1 you can see
the difference. The former gives a DataFrame with two columns even though it's
clear the rows where matched, and in the latter we see only one.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]