[GitHub] [spark] cloud-fan commented on issue #24442: [SPARK-27547][SQL] fix DataFrame self-join problems

GitBox Wed, 24 Apr 2019 02:47:33 -0700

cloud-fan commented on issue #24442: [SPARK-27547][SQL] fix DataFrame self-join 
problems
URL: https://github.com/apache/spark/pull/24442#issuecomment-486149813
 
 
   The basic idea is the same: assign a globally unique id to dataset, and 
carry the dataset id in the column reference(the `AttributeReference` returned 
by `Dataset.col`).
   
   This PR does one more thing: carry the dataset id in the logical plan of 
dataset in case of self-join. This makes the solution more powerful. #21449 can 
only resolve column reference with the current datasets, e.g. `df1.join(df2, 
cond)`, while this PR supports more general cases like `df1.join(df2, 
cond).filter(...).select(df1("id"))`.
   
   However, the hack in `Dataset.join` still has its value. For equal 
condition, we can resolve the column reference even if it's ambiguous, e.g. 
`df1.join(df1, df1("id") === df1("id"))`. `df1("id")` is actually ambigupus, 
but it doesn't matter here as equal condition is symmetrical.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan commented on issue #24442: [SPARK-27547][SQL] fix DataFrame self-join problems

Reply via email to