[
https://issues.apache.org/jira/browse/MESOS-711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13783531#comment-13783531
]
Benjamin Mahler commented on MESOS-711:
---------------------------------------
https://reviews.apache.org/r/14435/
https://reviews.apache.org/r/14436/
> Master::reconcile incorrectly recovers resources from reconciled tasks.
> -----------------------------------------------------------------------
>
> Key: MESOS-711
> URL: https://issues.apache.org/jira/browse/MESOS-711
> Project: Mesos
> Issue Type: Bug
> Affects Versions: 0.14.0
> Reporter: Benjamin Mahler
> Assignee: Benjamin Mahler
> Priority: Critical
> Fix For: 0.15.0, 0.14.1
>
>
> The following sequence of events will over-subscribe a slave in the allocator:
> --> Slave re-registers with the same master due to a slave restart. Tasks
> were running on the slave, but are lost in the process of the slave
> restarting.
> --> As a result, the slave includes no task / executor information in it's
> re-registration message.
> --> The slave is added back to the allocator with it's full resources, in
> Master::reregisterSlave():
> // If this is a disconnected slave, add it back to the allocator.
> if (slave->disconnected) {
> slave->disconnected = false; // Reset the flag.
> hashmap<FrameworkID, Resources> resources;
> foreach (const ExecutorInfo& executorInfo, executorInfos) {
> resources[executorInfo.framework_id()] += executorInfo.resources();
> }
> foreach (const Task& task, tasks) {
> // Ignore tasks that have reached terminal state.
> if (!protobuf::isTerminalState(task.state())) {
> resources[task.framework_id()] += task.resources();
> }
> }
> allocator->slaveAdded(slaveId, slaveInfo, resources);
> }
> --> Now reconciliation occurs, and the master sends TASK_LOST messages for
> each slave through Master::statusUpdate, which results in a call to
> Allocator::resourcesRecovered!
> --> Reconciliation also calls Allocator::resourcesRecovered for the unknown
> executors.
> --> These two bugs result in the allocator offering more resources than the
> slave contains.
> We can either change the re-registration code, or change the reconciliation
> code. The easiest fix here is to add the slave back taking into account the
> used resources from the slave *and the master's* information.
--
This message was sent by Atlassian JIRA
(v6.1#6144)