[
https://issues.apache.org/jira/browse/SPARK-6743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14543602#comment-14543602
]
Santiago M. Mola commented on SPARK-6743:
-----------------------------------------
This problem only happens for cached relations. Here is the root of the problem:
{code}
/* Fails. Got: Array(Row("A1"), Row("A2") */
assertResult(Array(Row(), Row()))(
InMemoryColumnarTableScan(Nil, Nil,
sqlc.table("tab0").queryExecution.sparkPlan.asInstanceOf[InMemoryColumnarTableScan].relation)
.execute().collect()
)
{code}
InMemoryColumnarTableScan returns the narrowest column when no attributes are
requested:
{code}
// Find the ordinals and data types of the requested columns. If none are
requested, use the
// narrowest (the field with minimum default element size).
val (requestedColumnIndices, requestedColumnDataTypes) = if
(attributes.isEmpty) {
val (narrowestOrdinal, narrowestDataType) =
relation.output.zipWithIndex.map { case (a, ordinal) =>
ordinal -> a.dataType
} minBy { case (_, dataType) =>
ColumnType(dataType).defaultSize
}
Seq(narrowestOrdinal) -> Seq(narrowestDataType)
} else {
attributes.map { a =>
relation.output.indexWhere(_.exprId == a.exprId) -> a.dataType
}.unzip
}
{code}
It seems this is what leads to incorrect results.
> Join with empty projection on one side produces invalid results
> ---------------------------------------------------------------
>
> Key: SPARK-6743
> URL: https://issues.apache.org/jira/browse/SPARK-6743
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.3.0
> Reporter: Santiago M. Mola
> Priority: Critical
>
> {code:java}
> val sqlContext = new SQLContext(sc)
> val tab0 = sc.parallelize(Seq(
> (83,0,38),
> (26,0,79),
> (43,81,24)
> ))
> sqlContext.registerDataFrameAsTable(sqlContext.createDataFrame(tab0),
> "tab0")
> sqlContext.cacheTable("tab0")
> val df1 = sqlContext.sql("SELECT tab0._2, cor0._2 FROM tab0, tab0 cor0 GROUP
> BY tab0._2, cor0._2")
> val result1 = df1.collect()
> val df2 = sqlContext.sql("SELECT cor0._2 FROM tab0, tab0 cor0 GROUP BY
> cor0._2")
> val result2 = df2.collect()
> val df3 = sqlContext.sql("SELECT cor0._2 FROM tab0 cor0 GROUP BY cor0._2")
> val result3 = df3.collect()
> {code}
> Given the previous code, result2 equals to Row(43), Row(83), Row(26), which
> is wrong. These results correspond to cor0._1, instead of cor0._2. Correct
> results would be Row(0), Row(81), which are ok for the third query. The first
> query also produces valid results, and the only difference is that the left
> side of the join is not empty.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]