[ https://issues.apache.org/jira/browse/SPARK-17972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15623363#comment-15623363 ]
Yin Huai commented on SPARK-17972: ---------------------------------- This issue has been resolved by https://github.com/apache/spark/pull/15651. > Query planning slows down dramatically for large query plans even when > sub-trees are cached > ------------------------------------------------------------------------------------------- > > Key: SPARK-17972 > URL: https://issues.apache.org/jira/browse/SPARK-17972 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.6.2, 2.0.1 > Reporter: Cheng Lian > Assignee: Cheng Lian > Fix For: 2.1.0 > > > The following Spark shell snippet creates a series of query plans that grow > exponentially. The {{i}}-th plan is created using 4 *cached* copies of the > {{i - 1}}-th plan. > {code} > (0 until 6).foldLeft(Seq(1, 2, 3).toDS) { (plan, iteration) => > val start = System.currentTimeMillis() > val result = plan.join(plan, "value").join(plan, "value").join(plan, > "value").join(plan, "value") > result.cache() > System.out.println(s"Iteration $iteration takes time > ${System.currentTimeMillis() - start} ms") > result.as[Int] > } > {code} > We can see that although all plans are cached, the query planning time still > grows exponentially and quickly becomes unbearable. > {noformat} > Iteration 0 takes time 9 ms > Iteration 1 takes time 19 ms > Iteration 2 takes time 61 ms > Iteration 3 takes time 219 ms > Iteration 4 takes time 830 ms > Iteration 5 takes time 4080 ms > {noformat} > Similar scenarios can be found in iterative ML code and significantly affects > usability. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org