GitHub user marmbrus opened a pull request:
https://github.com/apache/spark/pull/7920
[SPARK-9141][SQL] Remove project collapsing from DataFrame API
Currently we collapse successive projections that are added by
`withColumn`. However, this optimization violates the constraint that adding
nodes to a plan will never change its analyzed form and thus breaks caching.
Instead of doing early optimization, in this PR I instead just do fix some
low-hanging slowness in the analyzer. In particular, I add a mechanism for
skipping already analyzed subplans. Since trees are generally immutable after
construction, its safe to annotate a plan as already analyzed as any
transformation will create a new tree with this bit no longer set. While this
slows down slightly from the optimized form it is still pretty fast for
experiments with up to 200 columns.
```
Original Code
[info] 3430ms
[info] 2205ms
[info] 1973ms
[info] 1982ms
[info] 1916ms
Without Project Collapsing in DataFrame
[info] 44610ms
[info] 45977ms
[info] 46423ms
[info] 46306ms
[info] 54723ms
With analyzer optimizations
[info] 6394ms
[info] 4630ms
[info] 4388ms
[info] 4093ms
[info] 4113ms
```
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/marmbrus/spark withColumnCache
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/7920.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #7920
----
commit dee28763b4b72b7b344482d901fe82c24798c665
Author: Michael Armbrust <[email protected]>
Date: 2015-08-04T02:54:10Z
WIP
commit badd8243d07e156238955556f12463fd24271476
Author: Michael Armbrust <[email protected]>
Date: 2015-08-04T03:13:45Z
[SPARK-9141][SQL] Remove project collapsing from DataFrame API
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]