[ 
https://issues.apache.org/jira/browse/MESOS-7566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16032442#comment-16032442
 ] 

Yan Xu commented on MESOS-7566:
-------------------------------

Certain scenarios do seem problematic to me, e.g.,

- The agent's {{UpdateSlaveMessage}} reduces the the oversubscribed resources.
- {{Master::updateSlave}} upon receiving the update would first call 
{{HierarchicalAllocatorProcess::updateSlave}}, followed by 
{{allocator->recoverResources}}.
- {{HierarchicalAllocatorProcess::updateSlave}} would update 
{{roleSorter.total_}} to reduce to total so the total could go below the 
allocation.
- In the subsequent {{allocator->recoverResources}} call the attempt to remove 
outstanding allocation may fail to reduce it to below the total because some 
allocation may not be in outstanding offers. It could be in offered resources 
pending between {{Master::accept}} and {{Master::_accept}}. So the end result 
could still be {{total < allocation}}.
- Then when {{Master::_accept}} is executed, it will then call 
{{allocator->updateAllocation}}, in which the {{total < allocation}} condition 
could trigger such crash.

The root issue indeed looks to be MESOS-4553.

/cc [~bmahler] [~mcypark]

> Master crash due to failed check in DRFSorter::remove
> -----------------------------------------------------
>
>                 Key: MESOS-7566
>                 URL: https://issues.apache.org/jira/browse/MESOS-7566
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.1.1, 1.1.2
>            Reporter: Zhitao Li
>            Priority: Critical
>
> A check in [sorter.cpp#L355 in 1.1.2 | 
> https://github.com/apache/mesos/blob/1.1.2/src/master/allocator/sorter/drf/sorter.cpp#L355]
>  is triggered occasionally in our cluster and crashes the master leader.
> I manually modified that check to print out the related variables, and the 
> following is a master log.
> https://gist.github.com/zhitaoli/0662d9fe1f6d57de344951c05b536bad#file-gistfile1-txt
> From the log, it seems like the check was using an stale value revocable CPU  
> {{26}} while the new value was updated to 25, thus the check crashed.
> So far two verified occurrence of this bug are both observed near an 
> {{UNRESERVE}} operation (see lines above in the log).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to