[jira] [Updated] (MESOS-2215) The Docker containerizer attempts to recover any task when checkpointing is enabled, not just docker tasks.

Steve Niemitz (JIRA) Wed, 28 Jan 2015 14:19:51 -0800

     [ 
https://issues.apache.org/jira/browse/MESOS-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Steve Niemitz updated MESOS-2215:
---------------------------------
    Description: 
Once the slave restarts and recovers the task, I see this error in the log for 
all tasks that were recovered every second or so.  Note, these were NOT docker 
tasks:

W0113 16:01:00.790323 773142 monitor.cpp:213] Failed to get resource usage for  
container 7b729b89-dc7e-4d08-af97-8cd1af560a21 for executor 
thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd
 of framework 20150109-161713-715350282-5050-290797-0000: Failed to 'docker 
inspect mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21': exit status = exited with 
status 1 stderr = Error: No such image or container: 
mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21
However the tasks themselves are still healthy and running.

The slave was launched with --containerizers=mesos,docker

-----
More info: it looks like the docker containerizer is a little too ambitious 
about recovering containers, again this was not a docker task:
I0113 15:59:59.476145 773142 docker.cpp:814] Recovering container 
'7b729b89-dc7e-4d08-af97-8cd1af560a21' for executor 
'thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd'
 of framework 20150109-161713-715350282-5050-290797-0000

Looking into the source, it looks like the problem is that the 
ComposingContainerize runs recover in parallel, but neither the docker 
containerizer nor mesos containerizer check if they should recover the task or 
not (because they were the ones that launched it).  Perhaps this needs to be 
written into the checkpoint somewhere?

  was:
Once the slave restarts and recovers the task, I see this error in the log for 
all tasks that were recovered every second or so.  Note, these were NOT docker 
tasks:

W0113 16:01:00.790323 773142 monitor.cpp:213] Failed to get resource usage for  
container 7b729b89-dc7e-4d08-af97-8cd1af560a21 for executor 
thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd
 of framework 20150109-161713-715350282-5050-290797-0000: Failed to 'docker 
inspect mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21': exit status = exited with 
status 1 stderr = Error: No such image or container: 
mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21
However the tasks themselves are still healthy and running.

The slave was launched with --containerizers=mesos,docker

-----
More info: it looks like the docker containerizer is a little too ambitious 
about recovering containers, again this was not a docker task:
I0113 15:59:59.476145 773142 docker.cpp:814] Recovering container 
'7b729b89-dc7e-4d08-af97-8cd1af560a21' for executor 
'thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd'
 of framework 20150109-161713-715350282-5050-290797-0000

Looking into the source, it looks like the problem is that the 
ComposingContainerize runs recover in parallel, but neither the docker 
containerizer not mesos containerizer check if they should recover the task or 
not (because they were the ones that launched it).  Perhaps this needs to be 
written into the checkpoint somewhere?


> The Docker containerizer attempts to recover any task when checkpointing is 
> enabled, not just docker tasks.
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-2215
>                 URL: https://issues.apache.org/jira/browse/MESOS-2215
>             Project: Mesos
>          Issue Type: Bug
>          Components: docker
>    Affects Versions: 0.21.0
>            Reporter: Steve Niemitz
>            Assignee: Timothy Chen
>
> Once the slave restarts and recovers the task, I see this error in the log 
> for all tasks that were recovered every second or so.  Note, these were NOT 
> docker tasks:
> W0113 16:01:00.790323 773142 monitor.cpp:213] Failed to get resource usage 
> for  container 7b729b89-dc7e-4d08-af97-8cd1af560a21 for executor 
> thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd
>  of framework 20150109-161713-715350282-5050-290797-0000: Failed to 'docker 
> inspect mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21': exit status = exited 
> with status 1 stderr = Error: No such image or container: 
> mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21
> However the tasks themselves are still healthy and running.
> The slave was launched with --containerizers=mesos,docker
> -----
> More info: it looks like the docker containerizer is a little too ambitious 
> about recovering containers, again this was not a docker task:
> I0113 15:59:59.476145 773142 docker.cpp:814] Recovering container 
> '7b729b89-dc7e-4d08-af97-8cd1af560a21' for executor 
> 'thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd'
>  of framework 20150109-161713-715350282-5050-290797-0000
> Looking into the source, it looks like the problem is that the 
> ComposingContainerize runs recover in parallel, but neither the docker 
> containerizer nor mesos containerizer check if they should recover the task 
> or not (because they were the ones that launched it).  Perhaps this needs to 
> be written into the checkpoint somewhere?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2215) The Docker containerizer attempts to recover any task when checkpointing is enabled, not just docker tasks.

Reply via email to