[
https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229001#comment-14229001
]
Patrick Wendell edited comment on SPARK-4630 at 11/30/14 3:29 AM:
------------------------------------------------------------------
Hey Kos - before starting to work on the design for this feature, could you try
to quantify how important this actually is for performance? I.e. give some
examples at scale in some benchmarks or user workloads?
Spark in general is much less sensitive to the number of partitions than other
frameworks since the overhead of launching individual tasks is very small. For
this reason in the past we specifically decided not to introduce too much
complexity into Spark for this, but we did add some hueristics over time. It
seems like the proposal here is to extend the heuristics a bit, so some simpler
extensions might make sense.
was (Author: pwendell):
Hey Kos - before starting to work on the design for this feature, could you try
to quantify how important this actually is for performance? I.e. give some
examples at scale in some benchmarks or user workloads?
Spark in general is much less sensitive to the number of partitions than other
frameworks since the overhead of launching individual tasks is very small. For
this reason in the past we specifically decided not to introduce this
complexity into Spark.
> Dynamically determine optimal number of partitions
> --------------------------------------------------
>
> Key: SPARK-4630
> URL: https://issues.apache.org/jira/browse/SPARK-4630
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Reporter: Kostas Sakellis
> Assignee: Kostas Sakellis
>
> Partition sizes play a big part in how fast stages execute during a Spark
> job. There is a direct relationship between the size of partitions to the
> number of tasks - larger partitions, fewer tasks. For better performance,
> Spark has a sweet spot for how large partitions should be that get executed
> by a task. If partitions are too small, then the user pays a disproportionate
> cost in scheduling overhead. If the partitions are too large, then task
> execution slows down due to gc pressure and spilling to disk.
> To increase performance of jobs, users often hand optimize the number(size)
> of partitions that the next stage gets. Factors that come into play are:
> Incoming partition sizes from previous stage
> number of available executors
> available memory per executor (taking into account
> spark.shuffle.memoryFraction)
> Spark has access to this data and so should be able to automatically do the
> partition sizing for the user. This feature can be turned off/on with a
> configuration option.
> To make this happen, we propose modifying the DAGScheduler to take into
> account partition sizes upon stage completion. Before scheduling the next
> stage, the scheduler can examine the sizes of the partitions and determine
> the appropriate number tasks to create. Since this change requires
> non-trivial modifications to the DAGScheduler, a detailed design doc will be
> attached before proceeding with the work.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]