[
https://issues.apache.org/jira/browse/MESOS-6317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Benjamin Mahler updated MESOS-6317:
-----------------------------------
Summary: Race in master/allocator when updating oversubscribed resources of
an agent. (was: Race in master update slave.)
> Race in master/allocator when updating oversubscribed resources of an agent.
> ----------------------------------------------------------------------------
>
> Key: MESOS-6317
> URL: https://issues.apache.org/jira/browse/MESOS-6317
> Project: Mesos
> Issue Type: Bug
> Reporter: Guangya Liu
> Assignee: Guangya Liu
> Fix For: 1.1.0
>
>
> Currently, when {{updateSlave}} in master, it will first rescind offers and
> then updateSlave in allocator, but there is a race for this, there might be a
> batch allocation inserted bwteen the two. In this case, the order will be
> rescind offer -> batch allocation -> update slave. This order will cause some
> issues when the oversubscribed resources was decreased.
> Suppose the oversubscribed resources was decreased from 2 to 1, then after
> rescind offer finished, the batch allocation will allocate the old 2
> oversubscribed resources again, then update slave will update the total
> oversubscribed resources to 1. This will cause the agent host have some time
> overcommitted due to the tasks can still use 2 oversubscribed resources but
> not 1 oversubscribed resources, once the tasks using the 2 oversubscribed
> resources finished, everything goes back.
> So here we should adjust the order of rescind offer and updateSlave in master
> to avoid resource overcommit.
> If we update slave first then rescind offer, the order will be update slave
> -> batch allocation -> rescind offer, this order will have no problem when
> descreasing resources. Suppose the oversubscribed resources was decreased
> from 2 to 1, then update slave will update total oversubscribed resources to
> 1 directly, then the batch allocation will not allocate any oversubscribed
> resources since there are more allocated than total oversubscribed resources,
> then rescind offer will rescind all offers using oversubscribed resources.
> This will not lead the agent host to be overcommitted.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)