[ 
https://issues.apache.org/jira/browse/MESOS-5210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu updated MESOS-5210:
--------------------------
    Description: 
To unreserve the resource, the current prescribed workflow is to 

1. Wait for offer with the reserved resource after the scheduler/role is done 
using it.
2. Call unreserve on this resource/offer.
3. Done

However this is not reliable if:
1. Master fails to receive the call: this will result in the reserved resources 
to be offered to the role again, at least there is some signal in this case.
2. Master processes the call but the slave fails to receive the 
{{CheckpointResourcesMessage}} -> then if the master fails over, the slave will 
reregister with the resource still reserved -> inconsistency here.
3. Master and slave both have processed the call and the resource is 
unreserved, there is no guarantee that the role unreserving it would receive 
the offer back. Even if it receives an offer and if the reserved resource is  
fungible, it cannot distinguish between a resource that is newly unreserved or 
that is additional resource which is just freed up. 

If the framework doesn't go away, it can facilitate the reconciliation but if 
it wants to terminate, the question is when can it?

The best strategy right now seems to be to for the stopping framework to wait 
(with a timeout) for the offer to come back after the unreserve call for some 
verification that's not bulletproof and leave the rest to the operator.

We should improve the reliability of unreserve operations.

  was:
To unreserve the resource, the current prescribed workflow is to 

1. Wait for offer with the reserved resource after the scheduler/role is done 
using it.
2. Call unreserve on this resource/offer.
3. Done

However this is not reliable if:
1. Master fails to receive the call: this will result in the reserved resources 
to be offered to the role again, at least there is some signal in this case.
2. Master processes the call but the slave fails to receive the 
{{CheckpointResourcesMessage}} -> then is the slave restarts or the master 
fails over, the slave will reregister with the resource still reserved -> 
inconsistency here.
3. Master and slave both have processed the call and the resource is 
unreserved, there is no guarantee that the role unreserving it would receive 
the offer back. Even if it receives an offer and if the reserved resource is  
fungible, it cannot distinguish between a resource that is newly unreserved or 
that is additional resource which is just freed up. 

If the framework doesn't go away, it can facilitate the reconciliation but if 
it wants to terminate, the question is when can it?

The best strategy right now seems to be to for the stopping framework to wait 
(with a timeout) for the offer to come back after the unreserve call for some 
verification that's not bulletproof and leave the rest to the operator.

We should improve the reliability of unreserve operations.


> Reliably unreserving dynamically reserved resources is unattainable.
> --------------------------------------------------------------------
>
>                 Key: MESOS-5210
>                 URL: https://issues.apache.org/jira/browse/MESOS-5210
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Yan Xu
>
> To unreserve the resource, the current prescribed workflow is to 
> 1. Wait for offer with the reserved resource after the scheduler/role is done 
> using it.
> 2. Call unreserve on this resource/offer.
> 3. Done
> However this is not reliable if:
> 1. Master fails to receive the call: this will result in the reserved 
> resources to be offered to the role again, at least there is some signal in 
> this case.
> 2. Master processes the call but the slave fails to receive the 
> {{CheckpointResourcesMessage}} -> then if the master fails over, the slave 
> will reregister with the resource still reserved -> inconsistency here.
> 3. Master and slave both have processed the call and the resource is 
> unreserved, there is no guarantee that the role unreserving it would receive 
> the offer back. Even if it receives an offer and if the reserved resource is  
> fungible, it cannot distinguish between a resource that is newly unreserved 
> or that is additional resource which is just freed up. 
> If the framework doesn't go away, it can facilitate the reconciliation but if 
> it wants to terminate, the question is when can it?
> The best strategy right now seems to be to for the stopping framework to wait 
> (with a timeout) for the offer to come back after the unreserve call for some 
> verification that's not bulletproof and leave the rest to the operator.
> We should improve the reliability of unreserve operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to