GitHub user marmbrus opened a pull request:

    https://github.com/apache/spark/pull/5217

    [SPARK-6550][SQL] Use analyzed plan in DataFrame

    This is based on bug and test case proposed by @viirya.  See #5203 for a 
excellent description of the problem.
    
    TLDR; The problem occurs because the function `groupBy(String)` calls 
`resolve`, which returns an `AttributeReference`.  However, this 
`AttributeReference` is based on an analyzed plan which is thrown away.  At 
execution time, we once again analyze the plan.  However, in the case of 
self-joins, each call to analyze will produce a new tree for the left side of 
the join, rendering the previously returned `AttributeReference` invalid.
    
    As a fix, I propose we keep the analyzed plan instead of the logical plan 
inside of a data frame.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/marmbrus/spark preanalyzer

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5217.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5217
    
----
commit 089c52e5b5fc44e1f75b9156146ce649317e2375
Author: Michael Armbrust <[email protected]>
Date:   2015-03-26T19:13:55Z

    WIP

commit dd4dec1194272c84a71095f889e529d0a7970f65
Author: Michael Armbrust <[email protected]>
Date:   2015-03-26T21:14:10Z

    Use the analyzed plan in DataFrame

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to