[ 
https://issues.apache.org/jira/browse/SPARK-18886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15752579#comment-15752579
 ] 

Mridul Muralidharan edited comment on SPARK-18886 at 12/15/16 9:35 PM:
-----------------------------------------------------------------------

[~imranr] For almost all cases, delay scheduling dramatically increases 
performance. The difference even between PROCESS and NODE is significantly high 
(between NODE and 'lower' levels, it can depend on your network config).
For both tasks with short duration and tasks processing large amounts of data, 
it has non trivial impact : long tasks processing small data, it is not so 
useful in comparison iirc, same for degenerate cases where locality preference 
is suboptimal to begin with. [As an aside, the ability to not specify PROCESS 
level locality preference actually is a drawback in our api]

The job(s) I mentioned where we set it to 0 were special cases, where we knew 
the costs well enough to make the decision to lower it : but I would not 
recommend it unless users are very sure of what they are doing. While analysing 
the cost, it should also be kept in mind that transferring data across nodes 
impacts not just spark job, but every other job in the cluster.


was (Author: mridulm80):

[~imranr] For almost all cases, delay scheduling dramatically increases 
performance. The difference even between PROCESS and NODE is significantly high 
(between NODE and 'lower' levels, it can depend on your network config).
For both tasks with short duration and tasks processing large amounts of data, 
it has non trivial impact : long tasks processing small data, it is not so 
useful in comparison iirc, same for degenerate cases where locality preference 
is suboptimal to begin with. [As an aside, the ability to not specify PROCESS 
level locality actually is a drawback in our api]

The job(s) I mentioned where we set it to 0 were special cases, where we knew 
the costs well enough to make the decision to lower it : but I would not 
recommend it unless users are very sure of what they are doing. While analysing 
the cost, it should also be kept in mind that transferring data across nodes 
impacts not just spark job, but every other job in the cluster.

> Delay scheduling should not delay some executors indefinitely if one task is 
> scheduled before delay timeout
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-18886
>                 URL: https://issues.apache.org/jira/browse/SPARK-18886
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 2.1.0
>            Reporter: Imran Rashid
>
> Delay scheduling can introduce an unbounded delay and underutilization of 
> cluster resources under the following circumstances:
> 1. Tasks have locality preferences for a subset of available resources
> 2. Tasks finish in less time than the delay scheduling.
> Instead of having *one* delay to wait for resources with better locality, 
> spark waits indefinitely.
> As an example, consider a cluster with 100 executors, and a taskset with 500 
> tasks.  Say all tasks have a preference for one executor, which is by itself 
> on one host.  Given the default locality wait of 3s per level, we end up with 
> a 6s delay till we schedule on other hosts (process wait + host wait).
> If each task takes 5 seconds (under the 6 second delay), then _all 500_ tasks 
> get scheduled on _only one_ executor.  This means you're only using a 1% of 
> your cluster, and you get a ~100x slowdown.  You'd actually be better off if 
> tasks took 7 seconds.
> *WORKAROUNDS*: 
> (1) You can change the locality wait times so that it is shorter than the 
> task execution time.  You need to take into account the sum of all wait times 
> to use all the resources on your cluster.  For example, if you have resources 
> on different racks, this will include the sum of 
> "spark.locality.wait.process" + "spark.locality.wait.node" + 
> "spark.locality.wait.rack".  Those each default to "3s".  The simplest way to 
> be to set "spark.locality.wait.process" to your desired wait interval, and 
> set both "spark.locality.wait.node" and "spark.locality.wait.rack" to "0".  
> For example, if your tasks take ~3 seconds on average, you might set 
> "spark.locality.wait.process" to "1s".
> Note that this workaround isn't perfect --with less delay scheduling, you may 
> not get as good resource locality.  After this issue is fixed, you'd most 
> likely want to undo these configuration changes.
> (2) The worst case here will only happen if your tasks have extreme skew in 
> their locality preferences.  Users may be able to modify their job to 
> controlling the distribution of the original input data.
> (2a) A shuffle may end up with very skewed locality preferences, especially 
> if you do a repartition starting from a small number of partitions.  (Shuffle 
> locality preference is assigned if any node has more than 20% of the shuffle 
> input data -- by chance, you may have one node just above that threshold, and 
> all other nodes just below it.)  In this case, you can turn off locality 
> preference for shuffle data by setting 
> {{spark.shuffle.reduceLocality.enabled=false}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to