GitHub user marmbrus opened a pull request:

    https://github.com/apache/spark/pull/7920

    [SPARK-9141][SQL] Remove project collapsing from DataFrame API

    Currently we collapse successive projections that are added by 
`withColumn`.  However, this optimization violates the constraint that adding 
nodes to a plan will never change its analyzed form and thus breaks caching.  
Instead of doing early optimization, in this PR I instead just do fix some 
low-hanging slowness in the analyzer.  In particular, I add a mechanism for 
skipping already analyzed subplans.  Since trees are generally immutable after 
construction, its safe to annotate a plan as already analyzed as any 
transformation will create a new tree with this bit no longer set.  While this 
slows down slightly from the optimized form it is still pretty fast for 
experiments with up to 200 columns.
    
    ```
    Original Code
    [info] 3430ms
    [info] 2205ms
    [info] 1973ms
    [info] 1982ms
    [info] 1916ms
    
    Without Project Collapsing in DataFrame
    [info] 44610ms
    [info] 45977ms
    [info] 46423ms
    [info] 46306ms
    [info] 54723ms
    
    With analyzer optimizations
    [info] 6394ms
    [info] 4630ms
    [info] 4388ms
    [info] 4093ms
    [info] 4113ms
    ```

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/marmbrus/spark withColumnCache

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/7920.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #7920
    
----
commit dee28763b4b72b7b344482d901fe82c24798c665
Author: Michael Armbrust <[email protected]>
Date:   2015-08-04T02:54:10Z

    WIP

commit badd8243d07e156238955556f12463fd24271476
Author: Michael Armbrust <[email protected]>
Date:   2015-08-04T03:13:45Z

    [SPARK-9141][SQL] Remove project collapsing from DataFrame API

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to