Github user nsyca commented on the issue:

    https://github.com/apache/spark/pull/14899
  
    @hvanhovell, thanks for your feedback.
    
    I thought about narrowing the scope of my PR to just the subquery alias 
context. That would solve the problem I used in this PR. At first, I hesitated 
to fix it at this fundamental level. I then stepped back and thought about a 
bigger picture, from a semantics representation point of view. If a SQL 
statement (or an equivalence expressed in DataFrame/Dataset APIs) references a 
relation in more than once (says, two), we need to have a way to identify the 
two references of the same relation. This is done in Spark by having a unique 
(long int) identifier associated to a column/an expression of a relation. 
Correct me if I'm wrong, the implementation is in the `exprId` value of the 
`case class Alias`, which is implemented by calling the definition `newExprId`, 
which increments (by 1) for each time it is called. With this implementation, 
theoretically speaking, for any column/expression, at any level of a logical 
plan, we should be able to identify which stream the column is from (think
  of a relation is a vertex and a reference to a relation is an edge, we can 
have two edges come from the same vertex; then at the end of the two edges are 
the same column, say C1, but one comes from one stream (or edge) and the other 
comes from the second stream). This is the thinking behind my PR.  So limiting 
the scope to just the subquery alias may leave holes for future problems when 
more rewrite and optimization rules are getting fancier. We just can't 
guarantee two different parts of a SQL statement that look unable to merge into 
the same subquery block at the Analysis phase may merge by new rules and the 
problem of the name collision will surface.
    
    In fact, I am keen to remove the `dedupRight` code completely as I view it 
as a hacky way of resolving name collision. I can't as one of the test cases 
fails because of that. I think it is a test case where it expresses a 
(self-)join on the same relation using DataFrame APIs written in Scala. I never 
be able to produce it using SQL directly. I think it is because in SQL, it will 
always construct the `Project` operator on top of a relation and hence my PR 
shields it.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to