Hi everyone,

I'm currently porting a MapReduce Application to Spark (on a YARN cluster), and 
I'd like to have your insight regarding to the tuning of numbers of executors.

This application is in fact a template that users can use to launch a variety 
of jobs which range from tens to thousands  of splits in the data partition and 
have a typical wall clock time of 400 to 9000 seconds; these jobs are 
experiments that are usually performed once or a few times, thus are not  
easily tunable for resource consumption.

As such, as it is the case with the MR jobs, I'd like the users not to have to 
specify a number of executors themselves, which is why I explored  the 
possibilities of the Dynamic Allocation in Spark.

The current implementation of the dynamic allocation targets a total number of 
executors so that each core per executor executes 1 task; This seems to be in 
contradiction with spark guidelines that tasks should remain small and that 
each executor-core should process several tasks, and actually this gives the 
best overall latency but at the cost of a huge resource waste because some 
executors are not fully used, or even not used at all after having been 
allocated : in a representative set of experiments from our users, performed in 
an idle queue as well as in a busy queue, in average the latency in spark is 
decreased by 43%, but at the cost of an increase of 114% in the Vcore-hours 
usage w.r.t. the legacy MR job.

Up to now I can't migrate these jobs to spark because of the doubling of 
resource usage.

I did a proposal to allow tuning the target number of tasks that each 
executor-core (aka taskSlot) should process, which gives a way to tune the 
tradeoff between latency and vCore-Hours consumption: 
https://issues.apache.org/jira/browse/SPARK-22683

As detailed in the proposal, I've been able to reach a 37% reduction in latency 
at iso-consumption (2 tasks per taskSlot), or a 30% reduction in resource usage 
at iso-latency (6 tasks per taskSlot), or a sweet spot at 20% reduction in 
resource consumption and 28% reduction in latency at 3 tasks per slots. These 
figures are still averages over a representative set of jobs our users 
currently launch, and are to be compared with the doubling of resources usage 
of the current spark dynamic allocation behavior wrt MR.
As mentioned by Sean Owen in our discussion of the proposal, we currently have 
2 options allowing to tune the behavior of the dynamic allocation of executors, 
maxExecutors and schedulerBacklogTimeout, but these parameters would need to be 
tuned on a per-job basis, which is not compatible with the one-shot nature of 
the jobs I'm talking about.

I've still tried to tune one series of jobs with the schedulerBacklogTimeout, 
and managed to get a similar vcore-hours consumption at the expense of 20% 
added in latency:
- the resulting value of schedulerBacklogTimeout is only valid for other jobs 
that have a similar wall clock time, so will not be transposable to all the 
jobs our users launch;
- even with a manually-tuned job, I don't get the same efficiency as a more 
global default I can set using my proposal.

Thanks for any feedback regarding the pros and cons of adding a 3rd parameter 
to the dynamic allocation allowing to optimize the latency / consumption 
tradeoff over a family of jobs, or any proposal to achieve reducing resource 
usage without per job tuning with the existing Dynamic Allocation policy

Julien

Reply via email to