[
https://issues.apache.org/jira/browse/MESOS-8850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609692#comment-16609692
]
Meng Zhu commented on MESOS-8850:
---------------------------------
While the sequence of events in the description does indicate something wrong,
however, how this could lead to check failure mentioned in MESOS-8778 warrants
further investigation. One strange thing about MESOS-8778 is that how could the
code check failed at
https://github.com/apache/mesos/blob/16f70dbee7008cbc06455d901ffbba3e95591b48/src/master/allocator/mesos/hierarchical.cpp#L1125
but not
https://github.com/apache/mesos/blob/16f70dbee7008cbc06455d901ffbba3e95591b48/src/master/allocator/mesos/hierarchical.cpp#L2057
> Race between master and allocator when destroying shared volume could lead to
> sorter check failure.
> ---------------------------------------------------------------------------------------------------
>
> Key: MESOS-8850
> URL: https://issues.apache.org/jira/browse/MESOS-8850
> Project: Mesos
> Issue Type: Bug
> Components: allocation, master
> Reporter: Meng Zhu
> Priority: Major
>
> When destroying shared volume, master first rescinds offers that contain the
> shared volume and then apply the destroy operation. This process involves
> interaction between the master and allocator actor. The following race could
> arise:
> 1. Framework1 and framework2 are each offered a shared disk;
> 2. Framework2 asks the master to destroy the shared disk;
> 3. Master rescinds framework1's offer that contains the shared disk;
> 4. `allocator->recoverResources` is called to recover framework1’s offered
> resources in the allocator;
> 5. [Race] Allocator shortly allocates resources to framework1. The allocation
> contains the shared disk that just got recovered which has not been destroyed
> at the moment. Allocator invokes `offerCallback` which dispatches to the
> master;
> 6. Master continues the destroy operation and calls
> `allocator->updateAllocation` to notify the allocator to transform the shared
> disk to regular reserved disk;
> 7. Master processes the `offerCallback` dispatched in step5 and offered the
> shared disk to framework1.
> At this point, the same disk resource appears in two different places: one
> shared offered to framework1, one not shared currently hold by framework2
> (soon to be recovered).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)