[jira] [Commented] (MESOS-7639) Oversubscription could crash the master due to CHECK failure in the allocator

Zhitao Li (JIRA) Tue, 13 Jun 2017 11:36:10 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048218#comment-16048218
 ]


Zhitao Li commented on MESOS-7639:
----------------------------------

So, I've created a test in 
https://github.com/zhitaoli/mesos/tree/zhitao/1.1.2/drf_sorter_crash_test which 
reliably crashes with the matching condition. 

However, when applying the same test to current master (with minimal 
modification), this does not crash the master anymore (the branch is at 
https://github.com/zhitaoli/mesos/tree/zhitao/public/revocable_drf_crash_test)

With a bit more logging analysis, it seems like the change to 
{{HierarchicalAllocatorProcess::updateAllocation()}} in [r/55359 | 
https://reviews.apache.org/r/55359/diff/6#index_header] might have taken out 
the crashing scenario because it now updates the {{frameworkSorter}} by 
{{offeredResources}} rather than {{frameworkAllocation}}, so the 
{{frameworkSorter}} stays in an over-allocated situation during the race 
condition, until {{Master::_accept}} calls {{allocator->recoverResources}} from 
the offer, in which the over allocation gets corrected.

[~bmahler][~xujyan], can you please comment on whether my reading of above is 
correct? If so, I suspect we don't have a verified way to trigger a master 
crash due to over-allocation after 1.2.0 release?

> Oversubscription could crash the master due to CHECK failure in the allocator
> -----------------------------------------------------------------------------
>
>                 Key: MESOS-7639
>                 URL: https://issues.apache.org/jira/browse/MESOS-7639
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Yan Xu
>
> As I described in MESOS-7566, the following scenario is possible when the 
> agent sends updated oversubscribed resources to the master:
> - The agent's {{UpdateSlaveMessage}} reduces the the oversubscribed resources.
> - {{Master::updateSlave}} upon receiving the update would first call 
> {{HierarchicalAllocatorProcess::updateSlave}}, followed by 
> {{allocator->recoverResources}}.
> - {{HierarchicalAllocatorProcess::updateSlave}} would update 
> {{roleSorter.total_}} to reduce to total so the total could go below the 
> allocation.
> - In the subsequent {{allocator->recoverResources}} call the attempt to 
> remove outstanding allocation may fail to reduce it to below the total 
> because some allocation may not be in outstanding offers. It could be in 
> offered resources pending between {{Master::accept}} and {{Master::_accept}}. 
> So the end result could still be {{total < allocation}}.
> - Then when {{Master::_accept}} is executed, it will then call 
> {{allocator->updateAllocation}}, in which the {{total < allocation}} 
> condition could trigger such crash.
> The gist is that there are resources that are neither in master's {{offers}} 
> or tracked in the allocator when {{Master::updateSlave}} is called.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (MESOS-7639) Oversubscription could crash the master due to CHECK failure in the allocator

Reply via email to