Github user nsyca commented on the issue:
https://github.com/apache/spark/pull/14719
@sarutak would your code be able to solve this ambiguity of df("a") in the
join condition?
````
val df = Seq((1,0), (2,1)).toDF("a","b")
val df1 = df.filter(df("b") > 0)
val result = df.filter(df("a") > 0).join(df1, df("a") === df1("a"),
"left").select(df1("a"))
````
Here is my understanding.
At the first function
`df.filter(df("a") > 0) ... [1]`
Spark implicitly creates a new Dataset. So when it tries to resolve the
column `df("a")` in the argument of the join, the Dataset named `df` is not the
caller of the join. The Dataset `df` is actually embedded in the new unnamed
Dataset in `[1]`. So what we need here is a scheme to record where Datasets are
embedded in another Dataset.
Taking the above example, we can draw a tree resembling the embedded
structure of the Dataset `result`.
````
(result)
select
|
[2]
join
/ \
[1] (df1)
filter filter
| |
(df) (df)
````
where `[1]` is the unnamed Dataset mentioned above and `[2]` is another
unnamed Dataset.
`df.filter(df("a") > 0).join(df1, df("a") === df1("a"), "left") ... [2]`
When we try to resolve `df("a")` from [1]. There is only one `df` under
`[1]` so the resolution is not ambiguous. The problem here when we are trying
to resolve `df("a")` in the argument of the join operator, even when we have
the embedded tree structure, how do we distinguish the ambiguity of `df("a")`
from under `[1]` and from under `df1`?
Your breath-first-search walk may hit the correct `df` first but it should
continue to find the second `df` of the same level and if it found one, it
should raise an exception.
An interesting test scenario to verify this would be the one below:
`val result = df1.join(df.filter(df("a") > 0), df("a") === df1("a"),
"right").select(df1("a"))`
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]