[ 
https://issues.apache.org/jira/browse/SPARK-6743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14543466#comment-14543466
 ] 

Santiago M. Mola commented on SPARK-6743:
-----------------------------------------

Note that the bug is not related to GROUP BY, that's just a quick way to 
produce a Project logical plan with an empty projection list from SQL. Builing 
upon my previous test case, here are some further instances of the bug using 
logical plans and DataFrames:

{code}
import org.apache.spark.sql.catalyst.dsl.plans._
    import org.apache.spark.sql.catalyst.dsl.expressions._
    
    val plan0 = sqlc.table("tab0").logicalPlan.subquery('tab0)
    val plan1 = sqlc.table("tab1").logicalPlan.subquery('tab1)
    
    /* Succeeds */
    val planA = plan0.select('_1 as "c0")
      .join(plan1.select('_1 as "c1"))
      .select('c0, 'c1)
      .orderBy('c0.asc, 'c1.asc)
    assertResult(Array(Row("A1", "B1"), Row("A1", "B2"), Row("A2", "B1"), 
Row("A2", "B2")))(DataFrame(sqlc, planA).collect())

    /* Fails. Got: Array([A1], [A1], [A2], [A2]) */
    val planB = plan0.select('_1 as "c0")
      .join(plan1.select('_1 as "c1"))
      .select('c1)
      .orderBy('c1.asc)
    assertResult(Array(Row("B1"), Row("B1"), Row("B2"), 
Row("B2")))(DataFrame(sqlc, planB).collect())

    /* Fails. Got: Array([A1], [A1], [A2], [A2]) */
    val planC = plan0.select()
      .join(plan1.select('_1 as "c1"))
      .select('c1)
      .orderBy('c1.asc)
    assertResult(Array(Row("B1"), Row("B1"), Row("B2"), 
Row("B2")))(DataFrame(sqlc, planC).collect())
{code}

> Join with empty projection on one side produces invalid results
> ---------------------------------------------------------------
>
>                 Key: SPARK-6743
>                 URL: https://issues.apache.org/jira/browse/SPARK-6743
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.3.0
>            Reporter: Santiago M. Mola
>            Priority: Critical
>
> {code:java}
> val sqlContext = new SQLContext(sc)
> val tab0 = sc.parallelize(Seq(
>       (83,0,38),
>       (26,0,79),
>       (43,81,24)
>     ))
>     sqlContext.registerDataFrameAsTable(sqlContext.createDataFrame(tab0), 
> "tab0")
> sqlContext.cacheTable("tab0")   
> val df1 = sqlContext.sql("SELECT tab0._2, cor0._2 FROM tab0, tab0 cor0 GROUP 
> BY tab0._2, cor0._2")
> val result1 = df1.collect()
> val df2 = sqlContext.sql("SELECT cor0._2 FROM tab0, tab0 cor0 GROUP BY 
> cor0._2")
> val result2 = df2.collect()
> val df3 = sqlContext.sql("SELECT cor0._2 FROM tab0 cor0 GROUP BY cor0._2")
> val result3 = df3.collect()
> {code}
> Given the previous code, result2 equals to Row(43), Row(83), Row(26), which 
> is wrong. These results correspond to cor0._1, instead of cor0._2. Correct 
> results would be Row(0), Row(81), which are ok for the third query. The first 
> query also produces valid results, and the only difference is that the left 
> side of the join is not empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to