Matt Pollock created SPARK-11231:
------------------------------------

             Summary: join returns schema with duplicated and ambiguous join 
columns
                 Key: SPARK-11231
                 URL: https://issues.apache.org/jira/browse/SPARK-11231
             Project: Spark
          Issue Type: Bug
          Components: SparkR
    Affects Versions: 1.5.1
         Environment: R
            Reporter: Matt Pollock


In the case where the key column of two data frames are named the same thing, 
join returns a data frame where that column is duplicated. Since the content of 
the columns is guaranteed to be the same by row consolidating the identical 
columns into a single column would replicate standard R behavior and help 
prevent ambiguous names.

Example:
{code}
> df1 <- data.frame(key=c("A", "B", "C"), value1=c(1, 2, 3))
> df2 <- data.frame(key=c("A", "B", "C"), value2=c(4, 5, 6))
> sdf1 <- createDataFrame(sqlContext, df1)
> sdf2 <- createDataFrame(sqlContext, df2)
> sjdf <- join(sdf1, sdf2, sdf1$key == sdf2$key, "inner")
> schema(sjdf)
StructType
|-name = "key", type = "StringType", nullable = TRUE
|-name = "value1", type = "DoubleType", nullable = TRUE
|-name = "key", type = "StringType", nullable = TRUE
|-name = "value2", type = "DoubleType", nullable = TRUE
{code}

The duplicated key columns cause things like:
{code}
> library(magrittr)
> sjdf %>% select("key")
15/10/21 11:04:28 ERROR r.RBackendHandler: select on 1414 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
  org.apache.spark.sql.AnalysisException: Reference 'key' is ambiguous, could 
be: key#125, key#127.;
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:278)
        at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:162)
        at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$20.apply(Analyzer.scala:403)
        at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$20.apply(Analyzer.scala:403)
        at 
org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
        at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:403)
        at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:399)
        at org.apache.spark.sql.catalyst.tree
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to