[ 
https://issues.apache.org/jira/browse/MESOS-2601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Chen updated MESOS-2601:
--------------------------------
    Description: 
We've seen in our test cluster that tasks that were launched with the mesos 
containerizer are recovered after slave restart, but actual command process is 
not running anymore and the checkpointed executor is not marked as completed.

The Mesos containerizer recovers and all the isolators couldn't recover the 
task, but the containerizer itself is somehow never removed and the monitor 
kept calling usage on the containerizer.

Relevant log lines from the beginning of slave recovery:

I0408 18:06:33.261379 32504 slave.cpp:577] Successfully attached file 
'/hdd/mesos/slave/slaves/20150401-160104-251662508-5050-2197-S1/frameworks/20141222-194154-218108076-5050-4125-0004/executors/ct:1427921848104:0:EM
 DataDog Uploader:/runs/990741ed-909e-49cc-83f8-be63298872da'
...
I0408 18:06:36.583277 32511 containerizer.cpp:350] Recovering container 
'990741ed-909e-49cc-83f8-be63298872da' for executor 'ct:1427921848104:0:EM 
DataDog Uploader:' of framework 20141222-194154-218108076-5050-4125-0004
....
I0408 18:06:37.017122 32511 linux_launcher.cpp:162] Couldn't find freezer 
cgroup for container 990741ed-909e-49cc-83f8-be63298872da, assuming already 
destroyed
W0408 18:06:37.074916 32496 cpushare.cpp:199] Couldn't find cgroup for 
container 990741ed-909e-49cc-83f8-be63298872da
I0408 18:06:37.075173 32486 mem.cpp:158] Couldn't find cgroup for container 
990741ed-909e-49cc-83f8-be63298872da
E0408 18:06:37.092279 32496 containerizer.cpp:1136] Error in a resource 
limitation for container 990741ed-909e-49cc-83f8-be63298872da: Unknown container
I0408 18:06:37.092643 32496 containerizer.cpp:906] Destroying container 
'990741ed-909e-49cc-83f8-be63298872da'
W0408 18:06:37.229626 32501 containerizer.cpp:807] Ignoring update for 
currently being destroyed container: 990741ed-909e-49cc-83f8-be63298872da
W0408 18:06:38.129873 32484 containerizer.cpp:844] Skipping resource statistic 
for container 990741ed-909e-49cc-83f8-be63298872da because: Unknown container
W0408 18:06:38.129909 32484 containerizer.cpp:844] Skipping resource statistic 
for container 990741ed-909e-49cc-83f8-be63298872da because: Unknown container

  was:
We've seen in our test cluster that tasks that were launched with the mesos 
containerizer are recovered after slave restart, but actual command process is 
not running anymore and the checkpointed executor is not marked as completed.

The Mesos containerizer recovers and all the isolators couldn't recover the 
task, but the containerizer itself is somehow never removed and the monitor 
kept calling usage on the containerizer.


> Tasks are not removed after recovery from slave and mesos containerizer
> -----------------------------------------------------------------------
>
>                 Key: MESOS-2601
>                 URL: https://issues.apache.org/jira/browse/MESOS-2601
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization, slave
>    Affects Versions: 0.22.1
>            Reporter: Timothy Chen
>
> We've seen in our test cluster that tasks that were launched with the mesos 
> containerizer are recovered after slave restart, but actual command process 
> is not running anymore and the checkpointed executor is not marked as 
> completed.
> The Mesos containerizer recovers and all the isolators couldn't recover the 
> task, but the containerizer itself is somehow never removed and the monitor 
> kept calling usage on the containerizer.
> Relevant log lines from the beginning of slave recovery:
> I0408 18:06:33.261379 32504 slave.cpp:577] Successfully attached file 
> '/hdd/mesos/slave/slaves/20150401-160104-251662508-5050-2197-S1/frameworks/20141222-194154-218108076-5050-4125-0004/executors/ct:1427921848104:0:EM
>  DataDog Uploader:/runs/990741ed-909e-49cc-83f8-be63298872da'
> ...
> I0408 18:06:36.583277 32511 containerizer.cpp:350] Recovering container 
> '990741ed-909e-49cc-83f8-be63298872da' for executor 'ct:1427921848104:0:EM 
> DataDog Uploader:' of framework 20141222-194154-218108076-5050-4125-0004
> ....
> I0408 18:06:37.017122 32511 linux_launcher.cpp:162] Couldn't find freezer 
> cgroup for container 990741ed-909e-49cc-83f8-be63298872da, assuming already 
> destroyed
> W0408 18:06:37.074916 32496 cpushare.cpp:199] Couldn't find cgroup for 
> container 990741ed-909e-49cc-83f8-be63298872da
> I0408 18:06:37.075173 32486 mem.cpp:158] Couldn't find cgroup for container 
> 990741ed-909e-49cc-83f8-be63298872da
> E0408 18:06:37.092279 32496 containerizer.cpp:1136] Error in a resource 
> limitation for container 990741ed-909e-49cc-83f8-be63298872da: Unknown 
> container
> I0408 18:06:37.092643 32496 containerizer.cpp:906] Destroying container 
> '990741ed-909e-49cc-83f8-be63298872da'
> W0408 18:06:37.229626 32501 containerizer.cpp:807] Ignoring update for 
> currently being destroyed container: 990741ed-909e-49cc-83f8-be63298872da
> W0408 18:06:38.129873 32484 containerizer.cpp:844] Skipping resource 
> statistic for container 990741ed-909e-49cc-83f8-be63298872da because: Unknown 
> container
> W0408 18:06:38.129909 32484 containerizer.cpp:844] Skipping resource 
> statistic for container 990741ed-909e-49cc-83f8-be63298872da because: Unknown 
> container



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to