[
https://issues.apache.org/jira/browse/MESOS-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990141#comment-13990141
]
Till Toenshoff commented on MESOS-1243:
---------------------------------------
Recovery:
Right now {{recover}} is not container or executor specific, hence it shouldn't
fail just because a single one wasn't recoverable for any reason.
Let me draft this from the ExternalContainerizer's point of view in a failure
scenario;
Slave invokes {{launch}} and the EC tries to pass this on to the ECP. Now
assume the slave dies prior to the ECP actually being able to launch anything.
After a {{recover}} the slave now assumes that the ECP will be able to {{wait}}
on that container. The ECP however never {{launch}}ed that container, hence it
is unable to {{wait}}, thus is unable to return a {{Termination}}.
So the problem here has to be seen specifically minding that the ECP and the
slave may have differing status.
The quick way out of this is to allow that {{Termination}} to be optional.
Another way may be to make sure that the container is only checkpointed after a
fully achieved launch?
> Containerizer::wait return type should be Option<Termination>
> -------------------------------------------------------------
>
> Key: MESOS-1243
> URL: https://issues.apache.org/jira/browse/MESOS-1243
> Project: Mesos
> Issue Type: Improvement
> Reporter: Till Toenshoff
> Priority: Minor
> Labels: containerizer, external-containerizer, isolation, mesos,
> mesos-containerizer
>
> The containerizer {{wait}} should return an {{Option<Termination>}} to
> distinguish the case when it doesn't know about a {{ContainerID}}.
--
This message was sent by Atlassian JIRA
(v6.2#6252)