[
https://issues.apache.org/jira/browse/MESOS-5210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249989#comment-15249989
]
Neil Conway commented on MESOS-5210:
------------------------------------
I agree, this is an issue. In the past, we have talked about addressing the
problem via some combination of (a) reservation IDs, (b) a "reconciliation"
operation for reservations -- see MESOS-3746 and MESOS-3826.
> Reliably unreserving dynamically reserved resources is unattainable.
> --------------------------------------------------------------------
>
> Key: MESOS-5210
> URL: https://issues.apache.org/jira/browse/MESOS-5210
> Project: Mesos
> Issue Type: Bug
> Reporter: Yan Xu
>
> To unreserve the resource, the current prescribed workflow is to
> 1. Wait for offer with the reserved resource after the scheduler/role is done
> using it.
> 2. Call unreserve on this resource/offer.
> 3. Done
> However this is not reliable if:
> 1. Master fails to receive the call: this will result in the reserved
> resources to be offered to the role again, at least there is some signal in
> this case.
> 2. Master processes the call but the slave fails to receive the
> {{CheckpointResourcesMessage}} -> then if the master fails over, the slave
> will reregister with the resource still reserved -> inconsistency here.
> 3. Master and slave both have processed the call and the resource is
> unreserved, there is no guarantee that the role unreserving it would receive
> the offer back. Even if it receives an offer and if the reserved resource is
> fungible, it cannot distinguish between a resource that is newly unreserved
> or that is additional resource which is just freed up.
> If the framework doesn't go away, it can facilitate the reconciliation but if
> it wants to terminate, the question is when can it?
> The best strategy right now seems to be to for the stopping framework to wait
> (with a timeout) for the offer to come back after the unreserve call for some
> verification that's not bulletproof and leave the rest to the operator.
> We should improve the reliability of unreserve operations.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)