[
https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229407#comment-14229407
]
Patrick Wendell edited comment on SPARK-4630 at 12/1/14 4:41 AM:
-----------------------------------------------------------------
Thanks Sandy - that's useful context. I think the big kicker is whether to take
into account the size of the materialized data or not - that one will have much
more complexity than all of the other ideas proposed here, but it's also the
most likely to lead to effective results. There are assumptions that go deep
into Spark that we know the number of partitions before evaluation of the DAG
happens. You can checkout the Shark paper for some earlier work on relaxing
this assumption:
http://arxiv.org/pdf/1211.6176.pdf
Another question is - should we punt this type of optimization up to Spark SQL
where we have a richer data model and we might be able to relax assumptions
around internal execution more easily.
was (Author: pwendell):
Thanks Sandy - that's useful context. I think the big kicker is whether to take
into account the size of the materialized data or not - that one will have much
more complexity than all of the other ideas proposed here, but it's also the
most likely to lead to effective results. There are assumptions that go deep
into Spark that we know the number of partitions before evaluation of the DAG
happens. You can checkout the Shark paper for some earlier work on relaxing
this assumption:
http://arxiv.org/pdf/1211.6176.pdf
> Dynamically determine optimal number of partitions
> --------------------------------------------------
>
> Key: SPARK-4630
> URL: https://issues.apache.org/jira/browse/SPARK-4630
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Reporter: Kostas Sakellis
> Assignee: Kostas Sakellis
>
> Partition sizes play a big part in how fast stages execute during a Spark
> job. There is a direct relationship between the size of partitions to the
> number of tasks - larger partitions, fewer tasks. For better performance,
> Spark has a sweet spot for how large partitions should be that get executed
> by a task. If partitions are too small, then the user pays a disproportionate
> cost in scheduling overhead. If the partitions are too large, then task
> execution slows down due to gc pressure and spilling to disk.
> To increase performance of jobs, users often hand optimize the number(size)
> of partitions that the next stage gets. Factors that come into play are:
> Incoming partition sizes from previous stage
> number of available executors
> available memory per executor (taking into account
> spark.shuffle.memoryFraction)
> Spark has access to this data and so should be able to automatically do the
> partition sizing for the user. This feature can be turned off/on with a
> configuration option.
> To make this happen, we propose modifying the DAGScheduler to take into
> account partition sizes upon stage completion. Before scheduling the next
> stage, the scheduler can examine the sizes of the partitions and determine
> the appropriate number tasks to create. Since this change requires
> non-trivial modifications to the DAGScheduler, a detailed design doc will be
> attached before proceeding with the work.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]