[
https://issues.apache.org/jira/browse/SPARK-13801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240828#comment-15240828
]
Takeshi Yamamuro edited comment on SPARK-13801 at 4/14/16 8:59 AM:
-------------------------------------------------------------------
Seems your example also has wrong references;
{code}
val res = f.join(s, f("b") === s("b") and f("c") === s("c"), "outer").cache
res.select(coalesce(f("b"), s("b")), coalesce(f("c"), s("c")), coalesce(f("d"),
s("d"))).explain
{code}
An output is;
{code}
== Physical Plan ==
Project [coalesce(b#163, b#163) AS coalesce(b, b)#300,coalesce(c#164, c#164) AS
coalesce(c, c)#301,coalesce(d#165, d#165) AS coalesce(d, d)#302]
+- InMemoryColumnarTableScan [b#163,c#164,d#165], InMemoryRelation
[a#162,b#163,c#164,d#165,a#207,b#208,c#209,d#210], true, 10000,
StorageLevel(disk=true, memory=true, offheap=false, deserialized=true,
replication=1),
SortMergeJoin [b#163,c#164], [b#208,c#209], FullOuter, None, None
{code}
That is, each coalesce function refers the same column.
BTW, I got weird behaviours and the result is different between the master and
v1.6.1.
the master output is;
{code}
+--------------+--------------+--------------+
|coalesce(b, b)|coalesce(c, c)|coalesce(d, d)|
+--------------+--------------+--------------+
| 0| 0| 0|
| 1| 1| 1|
+--------------+--------------+--------------+
{code}
v1.6.1 output is;
{code}
|coalesce(b,b)|coalesce(c,c)|coalesce(d,d)|
+-------------+-------------+-------------+
| 1| 1| 1|
| null| null| null|
+-------------+-------------+-------------+
{code}
was (Author: maropu):
Seems your example also has wrong references;
{code}
val res = f.join(s, f("b") === s("b") and f("c") === s("c"), "outer").cache
res.select(coalesce(f("b"), s("b")), coalesce(f("c"), s("c")), coalesce(f("d"),
s("d"))).explain
{code}
An output is;
{code}
== Physical Plan ==
Project [coalesce(b#163, b#163) AS coalesce(b, b)#300,coalesce(c#164, c#164) AS
coalesce(c, c)#301,coalesce(d#165, d#165) AS coalesce(d, d)#302]
+- InMemoryColumnarTableScan [b#163,c#164,d#165], InMemoryRelation
[a#162,b#163,c#164,d#165,a#207,b#208,c#209,d#210], true, 10000,
StorageLevel(disk=true, memory=true, offheap=false, deserialized=true,
replication=1),
SortMergeJoin [b#163,c#164], [b#208,c#209], FullOuter, None, None
{code}
That is, each coalesce function has the same column.
BTW, I got weird behaviours and the result is different between the master and
v1.6.1.
the master output is;
{code}
+--------------+--------------+--------------+
|coalesce(b, b)|coalesce(c, c)|coalesce(d, d)|
+--------------+--------------+--------------+
| 0| 0| 0|
| 1| 1| 1|
+--------------+--------------+--------------+
{code}
v1.6.1 output is;
{code}
|coalesce(b,b)|coalesce(c,c)|coalesce(d,d)|
+-------------+-------------+-------------+
| 1| 1| 1|
| null| null| null|
+-------------+-------------+-------------+
{code}
> DataFrame.col should return unresolved attribute
> ------------------------------------------------
>
> Key: SPARK-13801
> URL: https://issues.apache.org/jira/browse/SPARK-13801
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Reporter: Wenchen Fan
>
> Recently I saw some JIRAs complain about wrong result when using DataFrame
> API. After checking their queries, I found it was caused by un-direct
> self-join and they build wrong join conditions. For example:
> {code}
> val df = ...
> val df2 = df.filter(...)
> df.join(df2, (df("key") + 1) === df2("key"))
> {code}
> In this case, the confusing part is: df("key") and df2("key2") reference to
> the same column, while df and df2 are different DataFrames.
> I think the biggest problem is, we give users the resolved attribute.
> However, resolved attribute is not real column, as logical plan's output may
> change. For example, we will generate new output for the right child in
> self-join.
> My proposal is: `DataFrame.col` should always return unresolved attribute. We
> can still do the resolution to make sure the given column name is resolvable,
> but don't return the resolved one, just get the name out and wrap it with
> UnresolvedAttribute.
> Now if users run the example query I gave at the beginning, they will get
> analysis exception, and they will understand they need to alias df and df2
> before join.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]