cloud-fan commented on issue #24442: [SPARK-27547][SQL] fix DataFrame self-join problems URL: https://github.com/apache/spark/pull/24442#issuecomment-486149813 The basic idea is the same: assign a globally unique id to dataset, and carry the dataset id in the column reference(the `AttributeReference` returned by `Dataset.col`). This PR does one more thing: carry the dataset id in the logical plan of dataset in case of self-join. This makes the solution more powerful. #21449 can only resolve column reference with the current datasets, e.g. `df1.join(df2, cond)`, while this PR supports more general cases like `df1.join(df2, cond).filter(...).select(df1("id"))`. However, the hack in `Dataset.join` still has its value. For equal condition, we can resolve the column reference even if it's ambiguous, e.g. `df1.join(df1, df1("id") === df1("id"))`. `df1("id")` is actually ambigupus, but it doesn't matter here as equal condition is symmetrical.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
