[ 
https://issues.apache.org/jira/browse/MESOS-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14083138#comment-14083138
 ] 

Benjamin Mahler commented on MESOS-1466:
----------------------------------------

Linking this as a potential blocker for MESOS-1654 as this overcommit bug will 
be problematic in the presence of ephemeral port allocations. 

A possible fix:

(1) Add an overcommit check in the slave when launching a task, if an 
overcommit occurs, then the slave will reject the task and effect a 
re-registration with the master.

(2) When the master reconciles a re-registering slave, add a check for 
executors present on the slave but missing on the master (we currently don't 
look for this).

> Race between executor exited event and launch task can cause overcommit of 
> resources
> ------------------------------------------------------------------------------------
>
>                 Key: MESOS-1466
>                 URL: https://issues.apache.org/jira/browse/MESOS-1466
>             Project: Mesos
>          Issue Type: Bug
>          Components: allocation, master
>            Reporter: Vinod Kone
>              Labels: reliability
>
> The following sequence of events can cause an overcommit
> --> Launch task is called for a task whose executor is already running
> --> Executor's resources are not accounted for on the master
> --> Executor exits and the event is enqueued behind launch tasks on the master
> --> Master sends the task to the slave which needs to commit for resources 
> for task and the (new) executor.
> --> Master processes the executor exited event and re-offers the executor's 
> resources causing an overcommit of resources.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to