GitHub user cloud-fan opened a pull request:
https://github.com/apache/spark/pull/20276
[SPARK-14948][SQL] disambiguate attributes in join condition
## What changes were proposed in this pull request?
`Dataset.col/apply` returns a column reference, which is pretty useful to
deal with duplicated names in join. e.g.
```
val df1 = ... // [a: int, b: int]
val df2 = ...// [b: int, c: int]
df1.join(df2, df1("b") === df2("b"))
df1.join(df2).drop(df2("b"))
...
```
However, this is problematic for self-join, or joining DataFrames derived
from the same DataFrame. The reason is that, the column reference returned by
`Dataset.col` is actually `AttributeReference`, which means different
DataFrames may return same column reference. After join, the right side would
be de-duplicated if it has conflicting attributes with the left side, and the
column reference returned by right side would be missing after join, or be
wrong and refers to columns from the left side.
To fix this issue entirely, we may need to define a real column reference
that is globally uique, and design a dataframe lineage mechanism so that we can
use column reference from another dataframe in a dataframe operation, e.g.
```
val df3 = df1.join(df2)
df3.drop(df2("b"))
```
This is a lot of work and is too late for 2.3, here I propose a simple and
safe solution to disambiguate attributes in join condition only, which is the
most common problematic case.
The idea is simple, we assign a globally unique id to each dataframe, via
`AnalysisBarrier`. `Dataset.col` returns a special attribute that carries the
id of dataframe it comes from. This special attribute is mostly a no-op and
will be removed during resolution. It's only used when we are de-duplicating
the join right side plan, these special attributes inside join condition would
be replaced by the new attributes generated by the right side plan.
## How was this patch tested?
new regression test
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/cloud-fan/spark join-bug
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20276.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20276
----
commit dd36ffb520b79c54e3efad9e79a88b3baf4fc985
Author: Wenchen Fan <wenchen@...>
Date: 2018-01-16T12:26:41Z
disambiguate attributes in join condition
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]