[ https://issues.apache.org/jira/browse/SPARK-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14907518#comment-14907518 ]
Matei Zaharia commented on SPARK-9850: -------------------------------------- Hey Imran, this could make sense, but note that the problem will only happen if you have 2000 map *output* partitions, which would've been 2000 reduce tasks normally. Otherwise, you can have as many map *tasks* as needed with fewer partitions. In most jobs, I'd expect data to get significantly smaller after the maps, so we'd catch that. In particular, for choosing between broadcast and shuffle joins this should be fine. We can do something different if we suspect that there is going to be tons of map output *and* we think there's nontrivial planning to be done once we see it. > Adaptive execution in Spark > --------------------------- > > Key: SPARK-9850 > URL: https://issues.apache.org/jira/browse/SPARK-9850 > Project: Spark > Issue Type: Epic > Components: Spark Core, SQL > Reporter: Matei Zaharia > Assignee: Yin Huai > Attachments: AdaptiveExecutionInSpark.pdf > > > Query planning is one of the main factors in high performance, but the > current Spark engine requires the execution DAG for a job to be set in > advance. Even with costÂ-based optimization, it is hard to know the behavior > of data and user-defined functions well enough to always get great execution > plans. This JIRA proposes to add adaptive query execution, so that the engine > can change the plan for each query as it sees what data earlier stages > produced. > We propose adding this to Spark SQL / DataFrames first, using a new API in > the Spark engine that lets libraries run DAGs adaptively. In future JIRAs, > the functionality could be extended to other libraries or the RDD API, but > that is more difficult than adding it in SQL. > I've attached a design doc by Yin Huai and myself explaining how it would > work in more detail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org