Josh Rosen created SPARK-17120:
----------------------------------
Summary: Analyzer incorrectly optimizes plan to empty LocalRelation
Key: SPARK-17120
URL: https://issues.apache.org/jira/browse/SPARK-17120
Project: Spark
Issue Type: Bug
Affects Versions: 2.1.0
Reporter: Josh Rosen
Priority: Blocker
Consider the following query:
{code}
sc.parallelize(Seq(97)).toDF("int_col_6").createOrReplaceTempView("table_3")
sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4")
println(sql("""
SELECT
*
FROM (
SELECT
COALESCE(t2.int_col_1, t1.int_col_6) AS int_col
FROM table_3 t1
LEFT JOIN table_4 t2 ON false
) t where (t.int_col) is not null
""").collect().toSeq)
{code}
In the innermost query, the LEFT JOIN's condition is {{false}} but nevertheless
the number of rows produced should equal the number of rows in {{table_3}}
(which is non-empty). Since no values are {{null}}, the outer {{where}} should
retain all rows, so the overall result of this query should contain a single
row with the value '97'.
Instead, the current Spark master (as of
12a89e55cbd630fa2986da984e066cd07d3bf1f7 at least) returns no rows. Looking at
{{explain}}, it appears that the logical plan is optimizing to {{LocalRelation
<empty>}}, so Spark doesn't even run the query. My suspicion is that there's a
bug in constraint propagation or filter pushdown.
This issue doesn't seem to affect Spark 2.0, so I think it's a regression in
master.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]