Ian Hellstrom created SPARK-15527:
-------------------------------------

             Summary: Duplicate column names with different case after join of 
DataFrames
                 Key: SPARK-15527
                 URL: https://issues.apache.org/jira/browse/SPARK-15527
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.4.1
            Reporter: Ian Hellstrom
             Fix For: 1.6.0


Column names can be duplicated when the cases (upper/lower/mixed) do not match 
in 1.4.1. In 1.6.0, I have checked it and Spark behaves as expected: the join 
columns are matched in a case-sensitive fashion. In 1.4.1 joins appear to be 
case-insensitive even though the results are inconsistent.

I did not find a related ticket, hence I'm opening this one even though it's 
technically fixed, just in case this happens to be a coincidence.

Here's a minimal example to check:

{code}
case class Test(id: Int, value: String)

val lhs = sc.parallelize(List(Test(1, "A"), Test(2, "B"), Test(3, "C"))).toDF
val rhs = sc.parallelize(List(Test(1, "AA"), Test(2, "BB"), Test(4, "D"))).toDF
val rhsId = rhs.withColumnRenamed("id", "ID")

val full = lhs.join(rhs, "id")
val fullId = lhs.join(rhsId, "id") // both id and ID in result in 1.4.1
val fullID = lhs.join(rhsId, "ID") // only id in result in 1.4.1
{code}

The last two joins don't execute on 1.6.0 because "id" is not found in rhsId 
(first case) and "ID" is not found in lhs (second case). On 1.4.1 you can see 
the difference. The former gives a DataFrame with two columns even though it's 
clear the rows where matched, and in the latter we see only one. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to