Hi everyone, I'm currently porting a MapReduce Application to Spark (on a YARN cluster), and I'd like to have your insight regarding to the tuning of numbers of executors.
This application is in fact a template that users can use to launch a variety of jobs which range from tens to thousands of splits in the data partition and have a typical wall clock time of 400 to 9000 seconds; these jobs are experiments that are usually performed once or a few times, thus are not easily tunable for resource consumption. As such, as it is the case with the MR jobs, I'd like the users not to have to specify a number of executors themselves, which is why I explored the possibilities of the Dynamic Allocation in Spark. The current implementation of the dynamic allocation targets a total number of executors so that each core per executor executes 1 task; This seems to be in contradiction with spark guidelines that tasks should remain small and that each executor-core should process several tasks, and actually this gives the best overall latency but at the cost of a huge resource waste because some executors are not fully used, or even not used at all after having been allocated : in a representative set of experiments from our users, performed in an idle queue as well as in a busy queue, in average the latency in spark is decreased by 43%, but at the cost of an increase of 114% in the Vcore-hours usage w.r.t. the legacy MR job. Up to now I can't migrate these jobs to spark because of the doubling of resource usage. I did a proposal to allow tuning the target number of tasks that each executor-core (aka taskSlot) should process, which gives a way to tune the tradeoff between latency and vCore-Hours consumption: https://issues.apache.org/jira/browse/SPARK-22683 As detailed in the proposal, I've been able to reach a 37% reduction in latency at iso-consumption (2 tasks per taskSlot), or a 30% reduction in resource usage at iso-latency (6 tasks per taskSlot), or a sweet spot at 20% reduction in resource consumption and 28% reduction in latency at 3 tasks per slots. These figures are still averages over a representative set of jobs our users currently launch, and are to be compared with the doubling of resources usage of the current spark dynamic allocation behavior wrt MR. As mentioned by Sean Owen in our discussion of the proposal, we currently have 2 options allowing to tune the behavior of the dynamic allocation of executors, maxExecutors and schedulerBacklogTimeout, but these parameters would need to be tuned on a per-job basis, which is not compatible with the one-shot nature of the jobs I'm talking about. I've still tried to tune one series of jobs with the schedulerBacklogTimeout, and managed to get a similar vcore-hours consumption at the expense of 20% added in latency: - the resulting value of schedulerBacklogTimeout is only valid for other jobs that have a similar wall clock time, so will not be transposable to all the jobs our users launch; - even with a manually-tuned job, I don't get the same efficiency as a more global default I can set using my proposal. Thanks for any feedback regarding the pros and cons of adding a 3rd parameter to the dynamic allocation allowing to optimize the latency / consumption tradeoff over a family of jobs, or any proposal to achieve reducing resource usage without per job tuning with the existing Dynamic Allocation policy Julien