[
https://issues.apache.org/jira/browse/SPARK-6743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14543466#comment-14543466
]
Santiago M. Mola commented on SPARK-6743:
-----------------------------------------
Note that the bug is not related to GROUP BY, that's just a quick way to
produce a Project logical plan with an empty projection list from SQL. Builing
upon my previous test case, here are some further instances of the bug using
logical plans and DataFrames:
{code}
import org.apache.spark.sql.catalyst.dsl.plans._
import org.apache.spark.sql.catalyst.dsl.expressions._
val plan0 = sqlc.table("tab0").logicalPlan.subquery('tab0)
val plan1 = sqlc.table("tab1").logicalPlan.subquery('tab1)
/* Succeeds */
val planA = plan0.select('_1 as "c0")
.join(plan1.select('_1 as "c1"))
.select('c0, 'c1)
.orderBy('c0.asc, 'c1.asc)
assertResult(Array(Row("A1", "B1"), Row("A1", "B2"), Row("A2", "B1"),
Row("A2", "B2")))(DataFrame(sqlc, planA).collect())
/* Fails. Got: Array([A1], [A1], [A2], [A2]) */
val planB = plan0.select('_1 as "c0")
.join(plan1.select('_1 as "c1"))
.select('c1)
.orderBy('c1.asc)
assertResult(Array(Row("B1"), Row("B1"), Row("B2"),
Row("B2")))(DataFrame(sqlc, planB).collect())
/* Fails. Got: Array([A1], [A1], [A2], [A2]) */
val planC = plan0.select()
.join(plan1.select('_1 as "c1"))
.select('c1)
.orderBy('c1.asc)
assertResult(Array(Row("B1"), Row("B1"), Row("B2"),
Row("B2")))(DataFrame(sqlc, planC).collect())
{code}
> Join with empty projection on one side produces invalid results
> ---------------------------------------------------------------
>
> Key: SPARK-6743
> URL: https://issues.apache.org/jira/browse/SPARK-6743
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.3.0
> Reporter: Santiago M. Mola
> Priority: Critical
>
> {code:java}
> val sqlContext = new SQLContext(sc)
> val tab0 = sc.parallelize(Seq(
> (83,0,38),
> (26,0,79),
> (43,81,24)
> ))
> sqlContext.registerDataFrameAsTable(sqlContext.createDataFrame(tab0),
> "tab0")
> sqlContext.cacheTable("tab0")
> val df1 = sqlContext.sql("SELECT tab0._2, cor0._2 FROM tab0, tab0 cor0 GROUP
> BY tab0._2, cor0._2")
> val result1 = df1.collect()
> val df2 = sqlContext.sql("SELECT cor0._2 FROM tab0, tab0 cor0 GROUP BY
> cor0._2")
> val result2 = df2.collect()
> val df3 = sqlContext.sql("SELECT cor0._2 FROM tab0 cor0 GROUP BY cor0._2")
> val result3 = df3.collect()
> {code}
> Given the previous code, result2 equals to Row(43), Row(83), Row(26), which
> is wrong. These results correspond to cor0._1, instead of cor0._2. Correct
> results would be Row(0), Row(81), which are ok for the third query. The first
> query also produces valid results, and the only difference is that the left
> side of the join is not empty.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]