Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/21449
I see what you mean. Honestly I have not thought of a full design for this
problem (so I can't state what we should support and what not), but focusing on
this specific case I think that:
- at the moment we do support self-joins (at least in the case
`df.join(df, df("id") >= df("id"))`) so considering this invalid would cause a
big behavior change (potentially causing user workflows to break).
- even though we might consider acceptable such a change in a major
release, I think that we should support with the Dataframe API what we support
in the SQL API, and SQL standard supports self joins (using aliases for the
tables). So I do believe we should support this use case.
- the case presented by @daniel-shields in
https://github.com/apache/spark/pull/21449#issuecomment-392947474, I think is a
valid one without any doubt. As of now we are not supporting it, though.
So I think that in the holistic approach we shouldn't change the current
behavior/approach which is present now and will be (IMHO) improved by this
patch.
What I do think we have to discuss in order not to have to change it - once
we want to solve the more generic issue - is the way to track the dataset an
attribute is coming from. Here I decided to use the metadata, since I thought
this is the cleanest approach. Another approach might be to introduce a new
`Option` in the `AttributeReference` a reference to the dataset it is coming
from.
For the generic solution, this might have the advantage that having a
reference to the provenance dataset, where we might want to store some kind of
DAG of the datasets this one is coming from in order to take more complex
decision about the validity of the syntax and/or the resolution of the
attribute.
What do you think?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]