[ https://issues.apache.org/jira/browse/SPARK-21122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-21122. ---------------------------------- Resolution: Incomplete > Address starvation issues when dynamic allocation is enabled > ------------------------------------------------------------ > > Key: SPARK-21122 > URL: https://issues.apache.org/jira/browse/SPARK-21122 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN > Affects Versions: 2.2.0, 2.3.0 > Reporter: Craig Ingram > Priority: Major > Labels: bulk-closed > Attachments: Preventing Starvation with Dynamic Allocation Enabled.pdf > > > When dynamic resource allocation is enabled on a cluster, it’s currently > possible for one application to consume all the cluster’s resources, > effectively starving any other application trying to start. This is > particularly painful in a notebook environment where notebooks may be idle > for tens of minutes while the user is figuring out what to do next (or eating > their lunch). Ideally the application should give resources back to the > cluster when monitoring indicates other applications are pending. > Before delving into the specifics of the solution. There are some workarounds > to this problem that are worth mentioning: > * Set spark.dynamicAllocation.maxExecutors to a small value, so that users > are unlikely to use the entire cluster even when many of them are doing work. > This approach will hurt cluster utilization. > * If using YARN, enable preemption and have each application (or > organization) run in a separate queue. The downside of this is that when YARN > preempts, it doesn't know anything about which executor it's killing. It > would just as likely kill a long running executor with cached data as one > that just spun up. Moreover, given a feature like > https://issues.apache.org/jira/browse/SPARK-21097 (preserving cached data on > executor decommission), YARN may not wait long enough between trying to > gracefully and forcefully shut down the executor. This would mean the blocks > that belonged to that executor would be lost and have to be recomputed. > * Configure YARN to use the capacity scheduler with multiple scheduler > queues. Put high-priority notebook users into a high-priority queue. Prevents > high-priority users from being starved out by low-priority notebook users. > Does not prevent users in the same priority class from starving each other. > Obviously any solution to this problem that depends on YARN would leave other > resource managers out in the cold. The solution proposed in this ticket will > afford spark clusters the flexibly to hook in different resource allocation > policies to fulfill their user's needs regardless of resource manager choice. > Initially the focus will be on users in a notebook environment. When > operating in a notebook environment with many users, the goal is fair > resource allocation. Given that all users will be using the same memory > configuration, this solution will focus primarily on fair sharing of cores. > The fair resource allocation policy should pick executors to remove based on > three factors initially: idleness, presence of cached data, and uptime. The > policy will favor removing executors that are idle, short-lived, and have no > cached data. The policy will only preemptively remove executors if there are > pending applications or cores (otherwise the default dynamic allocation > timeout/removal process is followed). The policy will also allow an > application's resource consumption to expand based on cluster utilization. > For example if there are 3 applications running but 2 of them are idle, the > policy will allow a busy application with pending tasks to consume more than > 1/3rd of the the cluster's resources. > More complexity could be added to take advantage of task/stage metrics, > histograms, and heuristics (i.e. favor removing executors running tasks that > are quick). The important thing here is to benchmark effectively before > adding complexity so we can measure the impact of the changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org