[ 
https://issues.apache.org/jira/browse/MESOS-2919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-2919:
--------------------------
    Story Points: 3

> Framework can overcommit oversubscribable resources during master failover.
> ---------------------------------------------------------------------------
>
>                 Key: MESOS-2919
>                 URL: https://issues.apache.org/jira/browse/MESOS-2919
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Jie Yu
>            Assignee: Jie Yu
>            Priority: Critical
>              Labels: twitter
>
> This is due to a bug in the hierarchical allocator. Here is the sequence of 
> events:
> 1) slave uses a fixed resource estimator which advertise 4 revocable cpus
> 2) a framework A launches a task that uses all the 4 revocable cpus
> 3) master fails over
> 4) slave re-registers with the new master, and sends UpdateSlaveMessage with 
> 4 revocable cpus as oversubscribed resources
> 5) framework A hasn't registered yet, therefore, the slave's available 
> resources will be 4 revocable cpus
> 6) framework A registered and will receive an additional 4 revocable cpus. So 
> it can launch another task with 4 revocable cpus (that means 8 total!)
> The problem is due to the way we calculate 'allocated' resource in allocator 
> when 'updateSlave'. If the framework is not registered, the 'allocation' 
> below is not accurate (check that if block in 'addSlave').
> {code}
> template <class RoleSorter, class FrameworkSorter>
> void
> HierarchicalAllocatorProcess<RoleSorter, FrameworkSorter>::updateSlave(
>     const SlaveID& slaveId,
>     const Resources& oversubscribed)
> {
>   CHECK(initialized);
>   CHECK(slaves.contains(slaveId));
>   // Check that all the oversubscribed resources are revocable.
>   CHECK_EQ(oversubscribed, oversubscribed.revocable());
>   // Update the total resources.
>   // First remove the old oversubscribed resources from the total.
>   slaves[slaveId].total -= slaves[slaveId].total.revocable();
>   // Now add the new estimate of oversubscribed resources.
>   slaves[slaveId].total += oversubscribed;
>   // Now, update the total resources in the role sorter.
>   roleSorter->update(
>       slaveId,
>       slaves[slaveId].total.unreserved());
>   // Calculate the current allocation of oversubscribed resources.
>   Resources allocation;
>   foreachkey (const std::string& role, roles) {
>     allocation += roleSorter->allocation(role, slaveId).revocable();
>   }
>   // Update the available resources.
>   // First remove the old oversubscribed resources from available.
>   slaves[slaveId].available -= slaves[slaveId].available.revocable();
>   // Now add the new estimate of available oversubscribed resources.
>   slaves[slaveId].available += oversubscribed - allocation;
>   LOG(INFO) << "Slave " << slaveId << " (" << slaves[slaveId].hostname
>             << ") updated with oversubscribed resources " << oversubscribed
>             << " (total: " << slaves[slaveId].total
>             << ", available: " << slaves[slaveId].available << ")";
>   allocate(slaveId);
> }
> template <class RoleSorter, class FrameworkSorter>
> void
> HierarchicalAllocatorProcess<RoleSorter, FrameworkSorter>::addSlave(
>     const SlaveID& slaveId,
>     const SlaveInfo& slaveInfo,
>     const Resources& total,
>     const hashmap<FrameworkID, Resources>& used)
> {
>   CHECK(initialized);
>   CHECK(!slaves.contains(slaveId));
>   roleSorter->add(slaveId, total.unreserved());
>   foreachpair (const FrameworkID& frameworkId,
>                const Resources& allocated,
>                used) {
>     if (frameworks.contains(frameworkId)) {
>       const std::string& role = frameworks[frameworkId].role;
>       // TODO(bmahler): Validate that the reserved resources have the
>       // framework's role.
>       roleSorter->allocated(role, slaveId, allocated.unreserved());
>       frameworkSorters[role]->add(slaveId, allocated);
>       frameworkSorters[role]->allocated(
>           frameworkId.value(), slaveId, allocated);
>     }
>   }
>   ...
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to