[
https://issues.apache.org/jira/browse/MESOS-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16097047#comment-16097047
]
Dmitriy Shirchenko commented on MESOS-7639:
-------------------------------------------
A small update on that we saw another instance of this crash. Since we have a
patched version I will provide the code below with logs
{code}
F0721 21:43:29.141577 7454 master.cpp:9218] CHECK_SOME(resources): Invalid
RESERVE Operation: cpus(*):24; mem(*):122880; ports(*):[31000-32000];
disk(*):849596; cpus(*)(allocated: aurora){REV}:12 does not contain
ports(aurora, aurora, {instance_key: foo/foo/foo.foo/0})(allocated:
aurora):[31139-31139, 31773-31773, 31827-31827]
{code}
Crash was happening on CHECK_SOME line.
{code}
void Slave::apply(const Offer::Operation& operation)
{
Try<Resources> resources = totalResources.apply(operation);
CHECK_SOME(resources);
totalResources = resources.get();
checkpointedResources = totalResources.filter(needCheckpointing);
}
{code}
Context is that a large job was getting updated with RESERVE resources.
[~bmahler] please let me know what else I can provide. Sorry, this may not be
enough for you to go off on.
> Oversubscription could crash the master due to CHECK failure in the allocator
> -----------------------------------------------------------------------------
>
> Key: MESOS-7639
> URL: https://issues.apache.org/jira/browse/MESOS-7639
> Project: Mesos
> Issue Type: Bug
> Reporter: Yan Xu
>
> As I described in MESOS-7566, the following scenario is possible when the
> agent sends updated oversubscribed resources to the master:
> - The agent's {{UpdateSlaveMessage}} reduces the the oversubscribed resources.
> - {{Master::updateSlave}} upon receiving the update would first call
> {{HierarchicalAllocatorProcess::updateSlave}}, followed by
> {{allocator->recoverResources}}.
> - {{HierarchicalAllocatorProcess::updateSlave}} would update
> {{roleSorter.total_}} to reduce to total so the total could go below the
> allocation.
> - In the subsequent {{allocator->recoverResources}} call the attempt to
> remove outstanding allocation may fail to reduce it to below the total
> because some allocation may not be in outstanding offers. It could be in
> offered resources pending between {{Master::accept}} and {{Master::_accept}}.
> So the end result could still be {{total < allocation}}.
> - Then when {{Master::_accept}} is executed, it will then call
> {{allocator->updateAllocation}}, in which the {{total < allocation}}
> condition could trigger such crash.
> The gist is that there are resources that are neither in master's {{offers}}
> or tracked in the allocator when {{Master::updateSlave}} is called.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)