[jira] [Updated] (MESOS-8209) mesos master should revoke offers when executor state changes

Jack Crawford (JIRA) Sun, 12 Nov 2017 10:51:31 -0800

     [ 
https://issues.apache.org/jira/browse/MESOS-8209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jack Crawford updated MESOS-8209:
---------------------------------
    Description: 
Currently, the mesos master does not revoke offers when the number of executors 
on an agent decreases. This is a problem under certain conditions, such when 
running a workflow that starts lots of small tasks on agents, with a one 
executor per task model, a master that does not revoke resources after a set 
amount of time, and a scheduler that does not reject resources.

The problem is that when running a mono-scheduler framework (which you might 
want to do to easily enforce authentication requirements, have a full view of 
all scheduled tasks, etc), in order to respond instantly when new tasks come in 
I have the scheduler simply hang on to all resource offers it receives, and the 
master is set to never revoke offers. This way the scheduler always has a pool 
of resources to quickly service new requests as they come in.

However, if you start tasks fast enough, the agents can fill up with executors, 
making it appear as there are no resources available for the scheduler to use. 
Ive seen this on r4.4xlarge machines on aws with executors that consume 0.1 
cpus, 32mb mem where the entire machine will be appear to be filled with 
executors according to the master resource offers. The executors are exiting 
(just after the task finishes), but the resources are not reclaimed because the 
master does not revoke the outstanding resource offers to reflect the change.

You can replicate this pretty easily if you schedule tasks that finish 
instantly with a 1-1 executor to task ratio. I find that if I schedule ~1000 
tasks this way on a single r4.4xlarge machine, usually 600-700 will finish 
before all the resource offers to the scheduler fill up and the agent appears 
to be "full" of executors.

Changing the scheduler/master to periodically reject/revoke resources fixes the 
problem.

My suggestion is for the master to revoke and reissue resource offers when the 
executor count changes on an agent.



  was:
Currently, the mesos master does not revoke offers when the number of executors 
on an agent decreases. This is a problem under certain conditions, such when 
running a workflow that starts lots of small tasks on agents, with a one 
executor per task model, a master that does not revoke resources after a set 
amount of time, and a scheduler that does not reject resources.

The problem is that when running a mono-scheduler framework (which you might 
want to do to easily enforce authentication requirements, have a full view of 
all scheduled tasks, etc), in order to respond instantly when new tasks come in 
I have the scheduler simply hang on to all resource offers it receives, and the 
master is set to never revoke offers. This way the scheduler always has a pool 
of resources to quickly service new requests as they come in.

However, if you start tasks fast enough, the agents can fill up with executors, 
making it appear as there are no resources available for the scheduler to use. 
Ive seen this on r4.4xlarge machines on aws with executors that consume 0.1 
cpus, 32mb mem where the entire machine will be appear to be filled with 
executors according to the master resource offers. The executors are exiting 
(just after the task finishes), but the resources are not reclaimed because the 
master does not revoke the outstanding resource offers to reflect the change.

You can replicate this pretty easily if you schedule tasks that finish 
instantly with a 1-1 executor to task ratio. I find that if I schedule ~1000 
tasks this way on a single machine, usually 600-700 will finish before all the 
resource offers to the scheduler fill up and the agent appears to be "full" of 
executors.

Changing the scheduler/master to periodically reject/revoke resources fixes the 
problem.

My suggestion is for the master to revoke and reissue resource offers when the 
executor count changes on an agent.




> mesos master should revoke offers when executor state changes
> -------------------------------------------------------------
>
>                 Key: MESOS-8209
>                 URL: https://issues.apache.org/jira/browse/MESOS-8209
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Jack Crawford
>
> Currently, the mesos master does not revoke offers when the number of 
> executors on an agent decreases. This is a problem under certain conditions, 
> such when running a workflow that starts lots of small tasks on agents, with 
> a one executor per task model, a master that does not revoke resources after 
> a set amount of time, and a scheduler that does not reject resources.
> The problem is that when running a mono-scheduler framework (which you might 
> want to do to easily enforce authentication requirements, have a full view of 
> all scheduled tasks, etc), in order to respond instantly when new tasks come 
> in I have the scheduler simply hang on to all resource offers it receives, 
> and the master is set to never revoke offers. This way the scheduler always 
> has a pool of resources to quickly service new requests as they come in.
> However, if you start tasks fast enough, the agents can fill up with 
> executors, making it appear as there are no resources available for the 
> scheduler to use. Ive seen this on r4.4xlarge machines on aws with executors 
> that consume 0.1 cpus, 32mb mem where the entire machine will be appear to be 
> filled with executors according to the master resource offers. The executors 
> are exiting (just after the task finishes), but the resources are not 
> reclaimed because the master does not revoke the outstanding resource offers 
> to reflect the change.
> You can replicate this pretty easily if you schedule tasks that finish 
> instantly with a 1-1 executor to task ratio. I find that if I schedule ~1000 
> tasks this way on a single r4.4xlarge machine, usually 600-700 will finish 
> before all the resource offers to the scheduler fill up and the agent appears 
> to be "full" of executors.
> Changing the scheduler/master to periodically reject/revoke resources fixes 
> the problem.
> My suggestion is for the master to revoke and reissue resource offers when 
> the executor count changes on an agent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (MESOS-8209) mesos master should revoke offers when executor state changes

Reply via email to