[
https://issues.apache.org/jira/browse/SPARK-6743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14543435#comment-14543435
]
Santiago M. Mola commented on SPARK-6743:
-----------------------------------------
Sorry, my first example was not very clear. Here is a more precise one:
{code}
val sqlc = new SQLContext(sc)
val tab0 = sc.parallelize(Seq(
Tuple1("A1"),
Tuple1("A2")
))
sqlc.registerDataFrameAsTable(sqlc.createDataFrame(tab0), "tab0")
sqlc.cacheTable("tab0")
val tab1 = sc.parallelize(Seq(
Tuple1("B1"),
Tuple1("B2")
))
sqlc.registerDataFrameAsTable(sqlc.createDataFrame(tab1), "tab1")
sqlc.cacheTable("tab1")
/* Succeeds */
val result1 = sqlc.sql("SELECT tab0._1,tab1._1 FROM tab0, tab1 GROUP BY
tab0._1,tab1._1 ORDER BY tab0._1, tab1._1").collect()
assertResult(Array(Row("A1", "B1"), Row("A1", "B2"), Row("A2", "B1"),
Row("A2", "B2")))(result1)
/* Fails. Got: Array([A1], [A2]) */
val result2 = sqlc.sql("SELECT tab1._1 FROM tab0, tab1 GROUP BY tab1._1
ORDER BY tab1._1").collect()
assertResult(Array(Row("B1"), Row("B2")))(result2)
{code}
> Join with empty projection on one side produces invalid results
> ---------------------------------------------------------------
>
> Key: SPARK-6743
> URL: https://issues.apache.org/jira/browse/SPARK-6743
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.3.0
> Reporter: Santiago M. Mola
> Priority: Critical
>
> {code:java}
> val sqlContext = new SQLContext(sc)
> val tab0 = sc.parallelize(Seq(
> (83,0,38),
> (26,0,79),
> (43,81,24)
> ))
> sqlContext.registerDataFrameAsTable(sqlContext.createDataFrame(tab0),
> "tab0")
> sqlContext.cacheTable("tab0")
> val df1 = sqlContext.sql("SELECT tab0._2, cor0._2 FROM tab0, tab0 cor0 GROUP
> BY tab0._2, cor0._2")
> val result1 = df1.collect()
> val df2 = sqlContext.sql("SELECT cor0._2 FROM tab0, tab0 cor0 GROUP BY
> cor0._2")
> val result2 = df2.collect()
> val df3 = sqlContext.sql("SELECT cor0._2 FROM tab0 cor0 GROUP BY cor0._2")
> val result3 = df3.collect()
> {code}
> Given the previous code, result2 equals to Row(43), Row(83), Row(26), which
> is wrong. These results correspond to cor0._1, instead of cor0._2. Correct
> results would be Row(0), Row(81), which are ok for the third query. The first
> query also produces valid results, and the only difference is that the left
> side of the join is not empty.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]