[
https://issues.apache.org/jira/browse/SPARK-18886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15756179#comment-15756179
]
Mridul Muralidharan commented on SPARK-18886:
---------------------------------------------
- Delay 'using up' all resources - another task/taskset which has better
locality preference might be available for that executor (also, see speculative
exec impact)
- A delay would cause a better locality preference to become available for the
task. Suboptimal schedule has a cascading effect on rest of the executors,
application and cluster.
- Note that not all executors are available at the same time in resourceOffer :
you have bulk reschedule periodically, sporadic reschedules when tasks finish
and periodic bulk speculative schedule updates.
> Delay scheduling should not delay some executors indefinitely if one task is
> scheduled before delay timeout
> -----------------------------------------------------------------------------------------------------------
>
> Key: SPARK-18886
> URL: https://issues.apache.org/jira/browse/SPARK-18886
> Project: Spark
> Issue Type: Bug
> Components: Scheduler
> Affects Versions: 2.1.0
> Reporter: Imran Rashid
>
> Delay scheduling can introduce an unbounded delay and underutilization of
> cluster resources under the following circumstances:
> 1. Tasks have locality preferences for a subset of available resources
> 2. Tasks finish in less time than the delay scheduling.
> Instead of having *one* delay to wait for resources with better locality,
> spark waits indefinitely.
> As an example, consider a cluster with 100 executors, and a taskset with 500
> tasks. Say all tasks have a preference for one executor, which is by itself
> on one host. Given the default locality wait of 3s per level, we end up with
> a 6s delay till we schedule on other hosts (process wait + host wait).
> If each task takes 5 seconds (under the 6 second delay), then _all 500_ tasks
> get scheduled on _only one_ executor. This means you're only using a 1% of
> your cluster, and you get a ~100x slowdown. You'd actually be better off if
> tasks took 7 seconds.
> *WORKAROUNDS*:
> (1) You can change the locality wait times so that it is shorter than the
> task execution time. You need to take into account the sum of all wait times
> to use all the resources on your cluster. For example, if you have resources
> on different racks, this will include the sum of
> "spark.locality.wait.process" + "spark.locality.wait.node" +
> "spark.locality.wait.rack". Those each default to "3s". The simplest way to
> be to set "spark.locality.wait.process" to your desired wait interval, and
> set both "spark.locality.wait.node" and "spark.locality.wait.rack" to "0".
> For example, if your tasks take ~3 seconds on average, you might set
> "spark.locality.wait.process" to "1s".
> Note that this workaround isn't perfect --with less delay scheduling, you may
> not get as good resource locality. After this issue is fixed, you'd most
> likely want to undo these configuration changes.
> (2) The worst case here will only happen if your tasks have extreme skew in
> their locality preferences. Users may be able to modify their job to
> controlling the distribution of the original input data.
> (2a) A shuffle may end up with very skewed locality preferences, especially
> if you do a repartition starting from a small number of partitions. (Shuffle
> locality preference is assigned if any node has more than 20% of the shuffle
> input data -- by chance, you may have one node just above that threshold, and
> all other nodes just below it.) In this case, you can turn off locality
> preference for shuffle data by setting
> {{spark.shuffle.reduceLocality.enabled=false}}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]