[jira] [Commented] (MESOS-8850) Race between master and allocator when destroying shared volume could lead to sorter check failure.

Meng Zhu (JIRA) Mon, 10 Sep 2018 12:20:09 -0700


    [ 
https://issues.apache.org/jira/browse/MESOS-8850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609692#comment-16609692
 ]


Meng Zhu commented on MESOS-8850:
---------------------------------

While the sequence of events in the description does indicate something wrong, 
however, how this could lead to check failure mentioned in MESOS-8778 warrants 
further investigation. One strange thing about MESOS-8778 is that how could the 
code check failed at 
https://github.com/apache/mesos/blob/16f70dbee7008cbc06455d901ffbba3e95591b48/src/master/allocator/mesos/hierarchical.cpp#L1125
 but not 
https://github.com/apache/mesos/blob/16f70dbee7008cbc06455d901ffbba3e95591b48/src/master/allocator/mesos/hierarchical.cpp#L2057

> Race between master and allocator when destroying shared volume could lead to 
> sorter check failure.
> ---------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-8850
>                 URL: https://issues.apache.org/jira/browse/MESOS-8850
>             Project: Mesos
>          Issue Type: Bug
>          Components: allocation, master
>            Reporter: Meng Zhu
>            Priority: Major
>
> When destroying shared volume, master first rescinds offers that contain the 
> shared volume and then apply the destroy operation. This process involves 
> interaction between the master and allocator actor. The following race could 
> arise:
> 1. Framework1 and framework2 are each offered a shared disk;
> 2. Framework2 asks the master to destroy the shared disk;
> 3. Master rescinds framework1's offer that contains the shared disk;
> 4. `allocator->recoverResources` is called to recover framework1’s offered 
> resources in the allocator;
> 5. [Race] Allocator shortly allocates resources to framework1. The allocation 
> contains the shared disk that just got recovered which has not been destroyed 
> at the moment. Allocator invokes `offerCallback` which dispatches to the 
> master;
> 6. Master continues the destroy operation and calls 
> `allocator->updateAllocation` to notify the allocator to transform the shared 
> disk to regular reserved disk;
> 7. Master processes the `offerCallback` dispatched in step5 and offered the 
> shared disk to framework1.
> At this point, the same disk resource appears in two different places: one 
> shared offered to framework1, one not shared currently hold by framework2 
> (soon to be recovered).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-8850) Race between master and allocator when destroying shared volume could lead to sorter check failure.

Reply via email to