[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

mgaido91 Sun, 03 Jun 2018 02:17:56 -0700

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/21449
  
    I see what you mean. Honestly I have not thought of a full design for this 
problem (so I can't state what we should support and what not), but focusing on 
this specific case I think that:
    
     - at the moment we do support self-joins (at least in the case 
`df.join(df, df("id") >= df("id"))`) so considering this invalid would cause a 
big behavior change (potentially causing user workflows to break).
     - even though we might consider acceptable such a change in a major 
release, I think that we should support with the Dataframe API what we support 
in the SQL API, and SQL standard supports self joins (using aliases for the 
tables). So I do believe we should support this use case.
     - the case presented by @daniel-shields in 
https://github.com/apache/spark/pull/21449#issuecomment-392947474, I think is a 
valid one without any doubt. As of now we are not supporting it, though.
    
    So I think that in the holistic approach we shouldn't change the current 
behavior/approach which is present now and will be (IMHO) improved by this 
patch.
    
    What I do think we have to discuss in order not to have to change it - once 
we want to solve the more generic issue - is the way to track the dataset an 
attribute is coming from. Here I decided to use the metadata, since I thought 
this is the cleanest approach. Another approach might be to introduce a new 
`Option` in the `AttributeReference` a reference to the dataset it is coming 
from.
    For the generic solution, this might have the advantage that having a 
reference to the provenance dataset, where we might want to store some kind of 
DAG of the datasets this one is coming from in order to take more complex 
decision about the validity of the syntax and/or the resolution of the 
attribute.
    
    What do you think?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21449: [SPARK-24385][SQL] Resolve self-join condition ambiguity...

Reply via email to