GitHub user liancheng opened a pull request:
https://github.com/apache/spark/pull/15517
[SPARK-17972][SQL] Cache analyzed plan instead of optimized plan to avoid
slow query planning
## What changes were proposed in this pull request?
Iterative ML code may easily create query plans that grow exponentially. We
found that query planning time also increases exponentially even when all the
sub-plan trees are cached.
The following snippet illustrates the problem:
```scala
(0 until 6).foldLeft(Seq(1, 2, 3).toDS) { (plan, iteration) =>
val start = System.currentTimeMillis()
val result = plan.join(plan, "value").join(plan, "value").join(plan,
"value").join(plan, "value")
result.cache()
System.out.println(s"Iteration $iteration takes time
${System.currentTimeMillis() - start} ms")
result.as[Int]
}
// Iteration 0 takes time 9 ms
// Iteration 1 takes time 19 ms
// Iteration 2 takes time 61 ms
// Iteration 3 takes time 219 ms
// Iteration 4 takes time 830 ms
// Iteration 5 takes time 4080 ms
```
This is because when caching a Dataset, what is cached is actually the
optimized plan, while we always build new Datasets upon analyzed plans of
existing Datasets. By caching analyzed plans instead of optimized plans, this
issue can be resolved.
Results produced by the same snippet (with more iterations) shown above
after applying this PR:
```scala
// Iteration 0 takes time 9 ms
// Iteration 1 takes time 45 ms
// Iteration 2 takes time 42 ms
// Iteration 3 takes time 33 ms
// Iteration 4 takes time 37 ms
// Iteration 5 takes time 64 ms
// Iteration 6 takes time 42 ms
// Iteration 7 takes time 38 ms
// Iteration 8 takes time 50 ms
// Iteration 9 takes time 53 ms
// Iteration 10 takes time 48 ms
// Iteration 11 takes time 45 ms
// Iteration 12 takes time 49 ms
// Iteration 13 takes time 46 ms
// Iteration 14 takes time 45 ms
// Iteration 15 takes time 41 ms
// Iteration 16 takes time 49 ms
// Iteration 17 takes time 46 ms
// Iteration 18 takes time 38 ms
// Iteration 19 takes time 32 ms
```
## How was this patch tested?
Existing tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/liancheng/spark fix-slow-planning
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15517.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15517
----
commit 292ef36a363ee4b2e0eac6e6686fe33c9b962120
Author: Cheng Lian <[email protected]>
Date: 2016-10-17T18:04:53Z
Cache analyzed plan instead of optimized plan
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]