[jira] [Commented] (SPARK-11231) join returns schema with duplicated and ambiguous join columns

Xiao Li (JIRA) Tue, 10 Nov 2015 23:01:19 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-11231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000030#comment-15000030
 ]


Xiao Li commented on SPARK-11231:
---------------------------------

Your join is not natural join. It could return duplicate names. You can specify 
the column names by adding "select" above the dataframe generated by the join. 
The Dataframe will optimize your query. 


> join returns schema with duplicated and ambiguous join columns
> --------------------------------------------------------------
>
>                 Key: SPARK-11231
>                 URL: https://issues.apache.org/jira/browse/SPARK-11231
>             Project: Spark
>          Issue Type: Bug
>          Components: SparkR
>    Affects Versions: 1.5.1
>         Environment: R
>            Reporter: Matt Pollock
>
> In the case where the key column of two data frames are named the same thing, 
> join returns a data frame where that column is duplicated. Since the content 
> of the columns is guaranteed to be the same by row consolidating the 
> identical columns into a single column would replicate standard R behavior[1] 
> and help prevent ambiguous names.
> Example:
> {code}
> > df1 <- data.frame(key=c("A", "B", "C"), value1=c(1, 2, 3))
> > df2 <- data.frame(key=c("A", "B", "C"), value2=c(4, 5, 6))
> > sdf1 <- createDataFrame(sqlContext, df1)
> > sdf2 <- createDataFrame(sqlContext, df2)
> > sjdf <- join(sdf1, sdf2, sdf1$key == sdf2$key, "inner")
> > schema(sjdf)
> StructType
> |-name = "key", type = "StringType", nullable = TRUE
> |-name = "value1", type = "DoubleType", nullable = TRUE
> |-name = "key", type = "StringType", nullable = TRUE
> |-name = "value2", type = "DoubleType", nullable = TRUE
> {code}
> The duplicated key columns cause things like:
> {code}
> > library(magrittr)
> > sjdf %>% select("key")
> 15/10/21 11:04:28 ERROR r.RBackendHandler: select on 1414 failed
> Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
>   org.apache.spark.sql.AnalysisException: Reference 'key' is ambiguous, could 
> be: key#125, key#127.;
>       at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:278)
>       at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:162)
>       at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$20.apply(Analyzer.scala:403)
>       at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$20.apply(Analyzer.scala:403)
>       at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
>       at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:403)
>       at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:399)
>       at org.apache.spark.sql.catalyst.tree
> {code}
> [1] In base R there is no"join", but a similar function "merge" is provided 
> in which a "by" argument identifies the shared key column in the two data 
> frames. In the case where the key column names differ "by.x" and "by.y" 
> arguments can be used. In the case of same-named key columns the 
> consolidation behavior requested above is observed. In the case of differing 
> names they "by.x" name is retained and consolidated with the "by.y" column 
> which is dropped.
> {code}
> > df1 <- data.frame(key=c("A", "B", "C"), value1=c(1, 2, 3))
> > df2 <- data.frame(key=c("A", "B", "C"), value2=c(4, 5, 6))
> > merge(df1, df2, by="key")
>   key value1 value2
> 1   A      1      4
> 2   B      2      5
> 3   C      3      6
> df3 <- data.frame(akey=c("A", "B", "C"), value1=c(1, 2, 3))
> > merge(df2, df3, by.x="key", by.y="akey")
>   key value2 value1
> 1   A      4      1
> 2   B      5      2
> 3   C      6      3
> > merge(df3, df2, by.x="akey", by.y="key")
>   akey value1 value2
> 1    A      1      4
> 2    B      2      5
> 3    C      3      6
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-11231) join returns schema with duplicated and ambiguous join columns

Reply via email to