[jira] [Commented] (MESOS-7639) Oversubscription could crash the master due to CHECK failure in the allocator

Dmitriy Shirchenko (JIRA) Fri, 21 Jul 2017 18:08:28 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16097047#comment-16097047
 ]


Dmitriy Shirchenko commented on MESOS-7639:
-------------------------------------------

A small update on that we saw another instance of this crash. Since we have a 
patched version I will provide the code below with logs 

{code}
F0721 21:43:29.141577  7454 master.cpp:9218] CHECK_SOME(resources): Invalid 
RESERVE Operation: cpus(*):24; mem(*):122880; ports(*):[31000-32000]; 
disk(*):849596; cpus(*)(allocated: aurora){REV}:12 does not contain 
ports(aurora, aurora, {instance_key: foo/foo/foo.foo/0})(allocated: 
aurora):[31139-31139, 31773-31773, 31827-31827]
{code}

Crash was happening on CHECK_SOME line.

{code}
void Slave::apply(const Offer::Operation& operation)
{
  Try<Resources> resources = totalResources.apply(operation);
  CHECK_SOME(resources);

  totalResources = resources.get();
  checkpointedResources = totalResources.filter(needCheckpointing);
}
{code}

Context is that a large job was getting updated with RESERVE resources. 
[~bmahler] please let me know what else I can provide. Sorry, this may not be 
enough for you to go off on.

> Oversubscription could crash the master due to CHECK failure in the allocator
> -----------------------------------------------------------------------------
>
>                 Key: MESOS-7639
>                 URL: https://issues.apache.org/jira/browse/MESOS-7639
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Yan Xu
>
> As I described in MESOS-7566, the following scenario is possible when the 
> agent sends updated oversubscribed resources to the master:
> - The agent's {{UpdateSlaveMessage}} reduces the the oversubscribed resources.
> - {{Master::updateSlave}} upon receiving the update would first call 
> {{HierarchicalAllocatorProcess::updateSlave}}, followed by 
> {{allocator->recoverResources}}.
> - {{HierarchicalAllocatorProcess::updateSlave}} would update 
> {{roleSorter.total_}} to reduce to total so the total could go below the 
> allocation.
> - In the subsequent {{allocator->recoverResources}} call the attempt to 
> remove outstanding allocation may fail to reduce it to below the total 
> because some allocation may not be in outstanding offers. It could be in 
> offered resources pending between {{Master::accept}} and {{Master::_accept}}. 
> So the end result could still be {{total < allocation}}.
> - Then when {{Master::_accept}} is executed, it will then call 
> {{allocator->updateAllocation}}, in which the {{total < allocation}} 
> condition could trigger such crash.
> The gist is that there are resources that are neither in master's {{offers}} 
> or tracked in the allocator when {{Master::updateSlave}} is called.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (MESOS-7639) Oversubscription could crash the master due to CHECK failure in the allocator

Reply via email to