GitHub user liancheng opened a pull request:

    https://github.com/apache/spark/pull/15517

    [SPARK-17972][SQL] Cache analyzed plan instead of optimized plan to avoid 
slow query planning

    ## What changes were proposed in this pull request?
    
    Iterative ML code may easily create query plans that grow exponentially. We 
found that query planning time also increases exponentially even when all the 
sub-plan trees are cached.
    
    The following snippet illustrates the problem:
    
    ```scala
    (0 until 6).foldLeft(Seq(1, 2, 3).toDS) { (plan, iteration) =>
      val start = System.currentTimeMillis()
      val result = plan.join(plan, "value").join(plan, "value").join(plan, 
"value").join(plan, "value")
      result.cache()
      System.out.println(s"Iteration $iteration takes time 
${System.currentTimeMillis() - start} ms")
      result.as[Int]
    }
    
    // Iteration 0 takes time 9 ms
    // Iteration 1 takes time 19 ms
    // Iteration 2 takes time 61 ms
    // Iteration 3 takes time 219 ms
    // Iteration 4 takes time 830 ms
    // Iteration 5 takes time 4080 ms
    ```
    
    This is because when caching a Dataset, what is cached is actually the 
optimized plan, while we always build new Datasets upon analyzed plans of 
existing Datasets. By caching analyzed plans instead of optimized plans, this 
issue can be resolved.
    
    Results produced by the same snippet (with more iterations) shown above 
after applying this PR:
    
    ```scala
    // Iteration 0 takes time 9 ms
    // Iteration 1 takes time 45 ms
    // Iteration 2 takes time 42 ms
    // Iteration 3 takes time 33 ms
    // Iteration 4 takes time 37 ms
    // Iteration 5 takes time 64 ms
    // Iteration 6 takes time 42 ms
    // Iteration 7 takes time 38 ms
    // Iteration 8 takes time 50 ms
    // Iteration 9 takes time 53 ms
    // Iteration 10 takes time 48 ms
    // Iteration 11 takes time 45 ms
    // Iteration 12 takes time 49 ms
    // Iteration 13 takes time 46 ms
    // Iteration 14 takes time 45 ms
    // Iteration 15 takes time 41 ms
    // Iteration 16 takes time 49 ms
    // Iteration 17 takes time 46 ms
    // Iteration 18 takes time 38 ms
    // Iteration 19 takes time 32 ms
    ```
    
    ## How was this patch tested?
    
    Existing tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liancheng/spark fix-slow-planning

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15517.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15517
    
----
commit 292ef36a363ee4b2e0eac6e6686fe33c9b962120
Author: Cheng Lian <l...@databricks.com>
Date:   2016-10-17T18:04:53Z

    Cache analyzed plan instead of optimized plan

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to