[
https://issues.apache.org/jira/browse/MESOS-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14083138#comment-14083138
]
Benjamin Mahler commented on MESOS-1466:
----------------------------------------
Linking this as a potential blocker for MESOS-1654 as this overcommit bug will
be problematic in the presence of ephemeral port allocations.
A possible fix:
(1) Add an overcommit check in the slave when launching a task, if an
overcommit occurs, then the slave will reject the task and effect a
re-registration with the master.
(2) When the master reconciles a re-registering slave, add a check for
executors present on the slave but missing on the master (we currently don't
look for this).
> Race between executor exited event and launch task can cause overcommit of
> resources
> ------------------------------------------------------------------------------------
>
> Key: MESOS-1466
> URL: https://issues.apache.org/jira/browse/MESOS-1466
> Project: Mesos
> Issue Type: Bug
> Components: allocation, master
> Reporter: Vinod Kone
> Labels: reliability
>
> The following sequence of events can cause an overcommit
> --> Launch task is called for a task whose executor is already running
> --> Executor's resources are not accounted for on the master
> --> Executor exits and the event is enqueued behind launch tasks on the master
> --> Master sends the task to the slave which needs to commit for resources
> for task and the (new) executor.
> --> Master processes the executor exited event and re-offers the executor's
> resources causing an overcommit of resources.
--
This message was sent by Atlassian JIRA
(v6.2#6252)