[
https://issues.apache.org/jira/browse/MESOS-8524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Benjamin Bannier updated MESOS-8524:
------------------------------------
Comment: was deleted
(was: Review: https://reviews.apache.org/r/65506/)
> When `UPDATE_SLAVE` messages are received, offers might not be rescinded due
> to a race
> ---------------------------------------------------------------------------------------
>
> Key: MESOS-8524
> URL: https://issues.apache.org/jira/browse/MESOS-8524
> Project: Mesos
> Issue Type: Bug
> Components: allocation, master
> Affects Versions: 1.5.0
> Environment: Master + Agent running with enabled
> {{RESOURCE_PROVIDER}} capability
> Reporter: Jan Schlicht
> Assignee: Benjamin Bannier
> Priority: Major
> Labels: mesosphere
>
> When an agent with enabled {{RESOURCE_PROVIDER}} capability (re-)registers
> with the master it sends a {{UPDATE_SLAVE}} after being (re-)registered. In
> the master, the agent is added (back) to the allocator, as soon as it's
> (re-)registered, i.e. before {{UPDATE_SLAVE}} is being send. This triggers an
> allocation and offers might get sent out to frameworks. When {{UPDATE_SLAVE}}
> is being handled in the master, these offers have to be rescinded, as they're
> based on an outdated agent state.
> Internally, the allocator defers a offer callback in the master
> ({{Master::offer}}). In rare cases a {{UPDATE_SLAVE}} message might arrive at
> the same time and its handler in the master called before the offer callback
> (but after the actual allocation took place). In this case the (outdated)
> offer is still sent to frameworks and never rescinded.
> Here's the relevant log lines, this was discovered while working on
> https://reviews.apache.org/r/65045/:
> {noformat}
> I0201 14:17:47.041093 242208768 hierarchical.cpp:1517] Performed allocation
> for 1 agents in 704915ns
> I0201 14:17:47.041738 242745344 master.cpp:7235] Received update of agent
> 53c557e7-3161-449b-bacc-a4f8c02e78e7-S0 at slave(540)@172.18.8.20:60469
> (172.18.8.20) with total oversubscribed resources {}
> I0201 14:17:47.042778 242745344 master.cpp:8808] Sending 1 offers to
> framework 53c557e7-3161-449b-bacc-a4f8c02e78e7-0000 (default) at
> [email protected]:60469
> I0201 14:17:47.043102 243281920 sched.cpp:921] Scheduler::resourceOffers took
> 40444ns
> I0201 14:17:47.043427 243818496 hierarchical.cpp:712] Grew agent
> 53c557e7-3161-449b-bacc-a4f8c02e78e7-S0 by disk[MOUNT]:200 (total), { }
> (used)
> I0201 14:17:47.043643 243818496 hierarchical.cpp:669] Agent
> 53c557e7-3161-449b-bacc-a4f8c02e78e7-S0 (172.18.8.20) updated with total
> resources disk[MOUNT]:200; cpus:2; mem:1024; disk:1024; ports:[31000-32000]
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)