[jira] [Comment Edited] (SPARK-4630) Dynamically determine optimal number of partitions

Patrick Wendell (JIRA) Sun, 30 Nov 2014 20:43:24 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229407#comment-14229407
 ]


Patrick Wendell edited comment on SPARK-4630 at 12/1/14 4:41 AM:
-----------------------------------------------------------------

Thanks Sandy - that's useful context. I think the big kicker is whether to take 
into account the size of the materialized data or not - that one will have much 
more complexity than all of the other ideas proposed here, but it's also the 
most likely to lead to effective results. There are assumptions that go deep 
into Spark that we know the number of partitions before evaluation of the DAG 
happens. You can checkout the Shark paper for some earlier work on relaxing 
this assumption:

http://arxiv.org/pdf/1211.6176.pdf

Another question is - should we punt this type of optimization up to Spark SQL 
where we have a richer data model and we might be able to relax assumptions 
around internal execution more easily.


was (Author: pwendell):
Thanks Sandy - that's useful context. I think the big kicker is whether to take 
into account the size of the materialized data or not - that one will have much 
more complexity than all of the other ideas proposed here, but it's also the 
most likely to lead to effective results. There are assumptions that go deep 
into Spark that we know the number of partitions before evaluation of the DAG 
happens. You can checkout the Shark paper for some earlier work on relaxing 
this assumption:

http://arxiv.org/pdf/1211.6176.pdf



> Dynamically determine optimal number of partitions
> --------------------------------------------------
>
>                 Key: SPARK-4630
>                 URL: https://issues.apache.org/jira/browse/SPARK-4630
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Kostas Sakellis
>            Assignee: Kostas Sakellis
>
> Partition sizes play a big part in how fast stages execute during a Spark 
> job. There is a direct relationship between the size of partitions to the 
> number of tasks - larger partitions, fewer tasks. For better performance, 
> Spark has a sweet spot for how large partitions should be that get executed 
> by a task. If partitions are too small, then the user pays a disproportionate 
> cost in scheduling overhead. If the partitions are too large, then task 
> execution slows down due to gc pressure and spilling to disk.
> To increase performance of jobs, users often hand optimize the number(size) 
> of partitions that the next stage gets. Factors that come into play are:
> Incoming partition sizes from previous stage
> number of available executors
> available memory per executor (taking into account 
> spark.shuffle.memoryFraction)
> Spark has access to this data and so should be able to automatically do the 
> partition sizing for the user. This feature can be turned off/on with a 
> configuration option. 
> To make this happen, we propose modifying the DAGScheduler to take into 
> account partition sizes upon stage completion. Before scheduling the next 
> stage, the scheduler can examine the sizes of the partitions and determine 
> the appropriate number tasks to create. Since this change requires 
> non-trivial modifications to the DAGScheduler, a detailed design doc will be 
> attached before proceeding with the work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-4630) Dynamically determine optimal number of partitions

Reply via email to