GitHub user marmbrus opened a pull request:
https://github.com/apache/spark/pull/5217
[SPARK-6550][SQL] Use analyzed plan in DataFrame
This is based on bug and test case proposed by @viirya. See #5203 for a
excellent description of the problem.
TLDR; The problem occurs because the function `groupBy(String)` calls
`resolve`, which returns an `AttributeReference`. However, this
`AttributeReference` is based on an analyzed plan which is thrown away. At
execution time, we once again analyze the plan. However, in the case of
self-joins, each call to analyze will produce a new tree for the left side of
the join, rendering the previously returned `AttributeReference` invalid.
As a fix, I propose we keep the analyzed plan instead of the logical plan
inside of a data frame.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/marmbrus/spark preanalyzer
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/5217.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #5217
----
commit 089c52e5b5fc44e1f75b9156146ce649317e2375
Author: Michael Armbrust <[email protected]>
Date: 2015-03-26T19:13:55Z
WIP
commit dd4dec1194272c84a71095f889e529d0a7970f65
Author: Michael Armbrust <[email protected]>
Date: 2015-03-26T21:14:10Z
Use the analyzed plan in DataFrame
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]