[jira] [Commented] (SPARK-17154) Wrong result can be returned or AnalysisException can be thrown after self-join or similar operations

Nattavut Sutyanyong (JIRA) Wed, 28 Sep 2016 10:46:43 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-17154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15530352#comment-15530352
 ]


Nattavut Sutyanyong commented on SPARK-17154:
---------------------------------------------

Strictly speaking, "df.filter(df("a") > 0)" generates a new Dataset object 
(let's name it "df1") and the second call to "filter" is called from the object 
"df1" so any references to "df" in the second call to "filter" should be 
disallowed.

Anyhow, if we want to support this case, we need to give the Dataset object of 
"df" a unique name. Perhaps your idea of adding SubqueryAlias to "df" can be 
used. With this, the mechanism of resolving df("b") would be

1. Look up the unique name associated to the Dataset object "df" (says, the 
result is, DS#01)
2. Search the name DS#01 in a SubqueryAlias operator in the current Dataset 
object "df1" (or "this")
3. Resolve the ExprId of the symbol "b" from the columns in the SubqueryAlias 
found in 2.

This is just a rough sketch of my idea. I have not coded it yet.

Your thoughts please.

> Wrong result can be returned or AnalysisException can be thrown after 
> self-join or similar operations
> -----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-17154
>                 URL: https://issues.apache.org/jira/browse/SPARK-17154
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.2, 2.0.0
>            Reporter: Kousuke Saruta
>         Attachments: Name-conflicts-2.pdf, Solution_Proposal_SPARK-17154.pdf
>
>
> When we join two DataFrames which are originated from a same DataFrame, 
> operations to the joined DataFrame can fail.
> One reproducible  example is as follows.
> {code}
> val df = Seq(
>   (1, "a", "A"),
>   (2, "b", "B"),
>   (3, "c", "C"),
>   (4, "d", "D"),
>   (5, "e", "E")).toDF("col1", "col2", "col3")
>   val filtered = df.filter("col1 != 3").select("col1", "col2")
>   val joined = filtered.join(df, filtered("col1") === df("col1"), "inner")
>   val selected1 = joined.select(df("col3"))
> {code}
> In this case, AnalysisException is thrown.
> Another example is as follows.
> {code}
> val df = Seq(
>   (1, "a", "A"),
>   (2, "b", "B"),
>   (3, "c", "C"),
>   (4, "d", "D"),
>   (5, "e", "E")).toDF("col1", "col2", "col3")
>   val filtered = df.filter("col1 != 3").select("col1", "col2")
>   val rightOuterJoined = filtered.join(df, filtered("col1") === df("col1"), 
> "right")
>   val selected2 = rightOuterJoined.select(df("col1"))
>   selected2.show
> {code}
> In this case, we will expect to get the answer like as follows.
> {code}
> 1
> 2
> 3
> 4
> 5
> {code}
> But the actual result is as follows.
> {code}
> 1
> 2
> null
> 4
> 5
> {code}
> The cause of the problems in the examples is that the logical plan related to 
> the right side DataFrame and the expressions of its output are re-created in 
> the analyzer (at ResolveReference rule) when a DataFrame has expressions 
> which have a same exprId each other.
> Re-created expressions are equally to the original ones except exprId.
> This will happen when we do self-join or similar pattern operations.
> In the first example, df("col3") returns a Column which includes an 
> expression and the expression have an exprId (say id1 here).
> After join, the expresion which the right side DataFrame (df) has is 
> re-created and the old and new expressions are equally but exprId is renewed 
> (say id2 for the new exprId here).
> Because of the mismatch of those exprIds, AnalysisException is thrown.
> In the second example, df("col1") returns a column and the expression 
> contained in the column is assigned an exprId (say id3).
> On the other hand, a column returned by filtered("col1") has an expression 
> which has the same exprId (id3).
> After join, the expressions in the right side DataFrame are re-created and 
> the expression assigned id3 is no longer present in the right side but 
> present in the left side.
> So, referring df("col1") to the joined DataFrame, we get col1 of right side 
> which includes null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-17154) Wrong result can be returned or AnalysisException can be thrown after self-join or similar operations

Reply via email to