[
https://issues.apache.org/jira/browse/SPARK-18886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15931893#comment-15931893
]
Imran Rashid commented on SPARK-18886:
--------------------------------------
Thanks Kay for the full description (and finding the old jira, sorry I didn't
notice the duplicate). Your explanation and alternative makes sense. One
detail from v1:
bq. flag passed by the TSM indicates that there are no other unused slots in
the cluster
neither the TSM nor TaskSchedulerImpl currently track this -- they know about
executors, but not individual slots. With bulk-scheduling calls to
{{resourceOffer()}} that include the entire set of slots that isn't a problem,
but it is for single offers. Anyway, its still solvable, just more bookkeeping
and more complex change.
bq. But often for Spark, you have one job running alone, in which case delay
scheduling should arguably be turned of altogether, as you suggested earlier
Imran. But let's separate that discussion from this one, of how to make it work
better.
yeah, you can see that earlier in the thread I was trying to figure out what
the purpose of this was anyway ... I am going to be recommending folks to turn
it off more often. But even when you have just one job running at a time, this
still matters for jobs with parallel stages in the DAG, eg. a join. Fairness
doesn't matter at all between the stages, but overall efficiency does. If you
turn delay scheduling off entirely, then whichever taskset comes first will get
all the resources, rather than giving both a shot at local resources. So I
feel like the right recommendation is {{1ms}}. There is probably something
else to fix and another jira here though I don't have a clear idea around it
yet.
I will keep thinking about your v2. What you are proposing makes sense, but I
worry that we continue to band-aid these situations where things are really
bad, but we're still stuck with a system where the delay should really be
closely tuned to the task length, otherwise there is a lot of inefficiency.
This wasted time isn't even tracked anywhere (its not included in "scheduler
delay"), so users have no idea their hitting this.
> Delay scheduling should not delay some executors indefinitely if one task is
> scheduled before delay timeout
> -----------------------------------------------------------------------------------------------------------
>
> Key: SPARK-18886
> URL: https://issues.apache.org/jira/browse/SPARK-18886
> Project: Spark
> Issue Type: Bug
> Components: Scheduler
> Affects Versions: 2.1.0
> Reporter: Imran Rashid
>
> Delay scheduling can introduce an unbounded delay and underutilization of
> cluster resources under the following circumstances:
> 1. Tasks have locality preferences for a subset of available resources
> 2. Tasks finish in less time than the delay scheduling.
> Instead of having *one* delay to wait for resources with better locality,
> spark waits indefinitely.
> As an example, consider a cluster with 100 executors, and a taskset with 500
> tasks. Say all tasks have a preference for one executor, which is by itself
> on one host. Given the default locality wait of 3s per level, we end up with
> a 6s delay till we schedule on other hosts (process wait + host wait).
> If each task takes 5 seconds (under the 6 second delay), then _all 500_ tasks
> get scheduled on _only one_ executor. This means you're only using a 1% of
> your cluster, and you get a ~100x slowdown. You'd actually be better off if
> tasks took 7 seconds.
> *WORKAROUNDS*:
> (1) You can change the locality wait times so that it is shorter than the
> task execution time. You need to take into account the sum of all wait times
> to use all the resources on your cluster. For example, if you have resources
> on different racks, this will include the sum of
> "spark.locality.wait.process" + "spark.locality.wait.node" +
> "spark.locality.wait.rack". Those each default to "3s". The simplest way to
> be to set "spark.locality.wait.process" to your desired wait interval, and
> set both "spark.locality.wait.node" and "spark.locality.wait.rack" to "0".
> For example, if your tasks take ~3 seconds on average, you might set
> "spark.locality.wait.process" to "1s". *NOTE*: due to SPARK-18967, avoid
> setting the {{spark.locality.wait=0}} -- instead, use
> {{spark.locality.wait=1ms}}.
> Note that this workaround isn't perfect --with less delay scheduling, you may
> not get as good resource locality. After this issue is fixed, you'd most
> likely want to undo these configuration changes.
> (2) The worst case here will only happen if your tasks have extreme skew in
> their locality preferences. Users may be able to modify their job to
> controlling the distribution of the original input data.
> (2a) A shuffle may end up with very skewed locality preferences, especially
> if you do a repartition starting from a small number of partitions. (Shuffle
> locality preference is assigned if any node has more than 20% of the shuffle
> input data -- by chance, you may have one node just above that threshold, and
> all other nodes just below it.) In this case, you can turn off locality
> preference for shuffle data by setting
> {{spark.shuffle.reduceLocality.enabled=false}}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]