[ 
https://issues.apache.org/jira/browse/MESOS-5210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249989#comment-15249989
 ] 

Neil Conway commented on MESOS-5210:
------------------------------------

I agree, this is an issue. In the past, we have talked about addressing the 
problem via some combination of (a) reservation IDs, (b) a "reconciliation" 
operation for reservations -- see MESOS-3746 and MESOS-3826.

> Reliably unreserving dynamically reserved resources is unattainable.
> --------------------------------------------------------------------
>
>                 Key: MESOS-5210
>                 URL: https://issues.apache.org/jira/browse/MESOS-5210
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Yan Xu
>
> To unreserve the resource, the current prescribed workflow is to 
> 1. Wait for offer with the reserved resource after the scheduler/role is done 
> using it.
> 2. Call unreserve on this resource/offer.
> 3. Done
> However this is not reliable if:
> 1. Master fails to receive the call: this will result in the reserved 
> resources to be offered to the role again, at least there is some signal in 
> this case.
> 2. Master processes the call but the slave fails to receive the 
> {{CheckpointResourcesMessage}} -> then if the master fails over, the slave 
> will reregister with the resource still reserved -> inconsistency here.
> 3. Master and slave both have processed the call and the resource is 
> unreserved, there is no guarantee that the role unreserving it would receive 
> the offer back. Even if it receives an offer and if the reserved resource is  
> fungible, it cannot distinguish between a resource that is newly unreserved 
> or that is additional resource which is just freed up. 
> If the framework doesn't go away, it can facilitate the reconciliation but if 
> it wants to terminate, the question is when can it?
> The best strategy right now seems to be to for the stopping framework to wait 
> (with a timeout) for the offer to come back after the unreserve call for some 
> verification that's not bulletproof and leave the rest to the operator.
> We should improve the reliability of unreserve operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to