Geoffroy Jabouley created MESOS-2508:
----------------------------------------

             Summary: Slave recovering a docker container results in "Unknow 
container" error
                 Key: MESOS-2508
                 URL: https://issues.apache.org/jira/browse/MESOS-2508
             Project: Mesos
          Issue Type: Bug
          Components: containerization, docker, slave
    Affects Versions: 0.21.1
         Environment: Ubuntu 14.04.2 LTS
Docker 1.5.0 (same error with 1.4.1)
Mesos 0.21.1 installed from mesosphere ubuntu repo
Marathon 0.8.0 installed from mesosphere ubuntu repo
            Reporter: Geoffroy Jabouley


I'm seeing some error logs occuring during a slave recovery of a Mesos task 
running into a docker container.

It does not impede slave recovery process, as the mesos task is still active 
and running on the slave afeter the recovery.

But there is something not working properly when the slave is recovering my 
docker container. The slave detects my container as an "Unknown container"


Cluster status:
- 1 mesos-master, 1 mesos-slave, 1 marathon framework running on the host.
- checkpointing is activated on both slave and framework
- use native docker containerizer
- 1 mesos task, started using marathon, is running inside a docker container 
and is monitored by the mesos-slave

Action:
- restart the mesos-slave process (sudo restart mesos-slave)

Expected:
- docker container still running
- mesos task still running
- no error in the mesos slave log regarding recovery process

Seen:
- docker container still running
- mesos task still running
- {color:red}Several errors *Unknown container* in the mesos slave log during 
recovery process{color}


-----------------------------------

For what it forth, here are my investigations:

1) The mesos task starts fine in the docker container 
*e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330*. Docker 
container name is set to *mesos-adb71dc4-c07d-42a9-8fed-264c241668ad* by Mesos 
docker containerizer _i guess_...

{code}
I0317 09:56:14.300439  2784 slave.cpp:1083] Got assigned task 
test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 for framework 
20150311-150951-3982541578-5050-50860-0000
I0317 09:56:14.380702  2784 slave.cpp:1193] Launching task 
test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 for framework 
20150311-150951-3982541578-5050-50860-0000
I0317 09:56:14.384466  2784 slave.cpp:3997] Launching executor 
test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework 
20150311-150951-3982541578-5050-50860-0000 in work directory 
'/tmp/mesos/slaves/20150312-145235-3982541578-5050-1421-S0/frameworks/20150311-150951-3982541578-5050-50860-0000/executors/test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799/runs/adb71dc4-c07d-42a9-8fed-264c241668ad'
I0317 09:56:14.390207  2784 slave.cpp:1316] Queuing task 
'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' for executor 
test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework 
'20150311-150951-3982541578-5050-50860-0000
I0317 09:56:14.421787  2782 docker.cpp:927] Starting container 
'adb71dc4-c07d-42a9-8fed-264c241668ad' for task 
'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' (and executor 
'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799') of framework 
'20150311-150951-3982541578-5050-50860-0000'
I0317 09:56:15.784143  2781 docker.cpp:633] Checkpointing pid 27080 to 
'/tmp/mesos/meta/slaves/20150312-145235-3982541578-5050-1421-S0/frameworks/20150311-150951-3982541578-5050-50860-0000/executors/test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799/runs/adb71dc4-c07d-42a9-8fed-264c241668ad/pids/forked.pid'
I0317 09:56:15.789443  2784 slave.cpp:2840] Monitoring executor 
'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework 
'20150311-150951-3982541578-5050-50860-0000' in container 
'adb71dc4-c07d-42a9-8fed-264c241668ad'
I0317 09:56:15.862642  2784 slave.cpp:1860] Got registration for executor 
'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework 
20150311-150951-3982541578-5050-50860-0000 from executor(1)@10.195.96.237:36021
I0317 09:56:15.865319  2784 slave.cpp:1979] Flushing queued task 
test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 for executor 
'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework 
20150311-150951-3982541578-5050-50860-0000
I0317 09:56:15.885414  2787 slave.cpp:2215] Handling status update TASK_RUNNING 
(UUID: 79f49cec-92c7-4660-b54e-22dd19c1e67c) for task 
test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework 
20150311-150951-3982541578-5050-50860-0000 from executor(1)@10.195.96.237:36021
I0317 09:56:15.885902  2787 status_update_manager.cpp:317] Received status 
update TASK_RUNNING (UUID: 79f49cec-92c7-4660-b54e-22dd19c1e67c) for task 
test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework 
20150311-150951-3982541578-5050-50860-0000
I0317 09:56:15.886327  2787 status_update_manager.hpp:346] Checkpointing UPDATE 
for status update TASK_RUNNING (UUID: 79f49cec-92c7-4660-b54e-22dd19c1e67c) for 
task test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework 
20150311-150951-3982541578-5050-50860-0000
I0317 09:56:15.904325  2785 docker.cpp:1298] Updated 'cpu.shares' to 204 at 
/sys/fs/cgroup/cpu/docker/e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330
 for container adb71dc4-c07d-42a9-8fed-264c241668ad
I0317 09:56:15.904831  2785 docker.cpp:1333] Updated 
'memory.soft_limit_in_bytes' to 96MB for container 
adb71dc4-c07d-42a9-8fed-264c241668ad
I0317 09:56:15.905591  2785 docker.cpp:1359] Updated 'memory.limit_in_bytes' to 
96MB at 
/sys/fs/cgroup/memory/docker/e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330
 for container adb71dc4-c07d-42a9-8fed-264c241668ad
I0317 09:56:16.073200  2787 slave.cpp:2458] Forwarding the update TASK_RUNNING 
(UUID: 79f49cec-92c7-4660-b54e-22dd19c1e67c) for task 
test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework 
20150311-150951-3982541578-5050-50860-0000 to [email protected]:5050
I0317 09:56:16.073386  2787 slave.cpp:2391] Sending acknowledgement for status 
update TASK_RUNNING (UUID: 79f49cec-92c7-4660-b54e-22dd19c1e67c) for task 
test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework 
20150311-150951-3982541578-5050-50860-0000 to executor(1)@10.195.96.237:36021
I0317 09:56:16.226022  2788 status_update_manager.cpp:389] Received status 
update acknowledgement (UUID: 79f49cec-92c7-4660-b54e-22dd19c1e67c) for task 
test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework 
20150311-150951-3982541578-5050-50860-0000
I0317 09:56:16.226093  2788 status_update_manager.hpp:346] Checkpointing ACK 
for status update TASK_RUNNING (UUID: 79f49cec-92c7-4660-b54e-22dd19c1e67c) for 
task test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework 
20150311-150951-3982541578-5050-50860-0000
{code}

{code}
$ docker ps --no-trunc
CONTAINER ID                                                       IMAGE        
           COMMAND                                             CREATED          
   STATUS              PORTS               NAMES
e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330   
ecpff/ecp_base:latest   "/bin/sh -c 'env && ls -Ral /mnt && sleep 3600s'"   11 
minutes ago      Up 11 minutes                           
mesos-adb71dc4-c07d-42a9-8fed-264c241668ad
{code}


2) I restart the mesos slave process
{code}
$ sudo restart mesos-slave
{code}

{code}
W0317 09:56:21.672792  2754 logging.cpp:81] RAW: Received signal SIGTERM from 
process 1 of user 0; exiting
[DEBUG] /usr/sbin/mesos-slave --master=zk://10.195.96.237:2181/mesos 
--checkpoint --containerizers=docker,mesos 
--executor_registration_timeout=5mins --hostname=10.195.96.237 
--ip=10.195.96.237 --isolation=cgroups/cpu,cgroups/mem --recover=reconnect 
--recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443]
I0317 09:56:48.411228 27147 main.cpp:142] Build: 2015-01-09 02:25:56 by root
I0317 09:56:48.411418 27147 main.cpp:144] Version: 0.21.1
I0317 09:56:48.411427 27147 main.cpp:147] Git tag: 0.21.1
I0317 09:56:48.411434 27147 main.cpp:151] Git SHA: 
2ae1ba91e64f92ec71d327e10e6ba9e8ad5477e8
I0317 09:56:48.521886 27147 containerizer.cpp:100] Using isolation: 
cgroups/cpu,cgroups/mem
I0317 09:56:48.570104 27147 linux_launcher.cpp:94] Using /sys/fs/cgroup/freezer 
as the freezer hierarchy for the Linux launcher
I0317 09:56:48.577716 27147 main.cpp:165] Starting Mesos slave
2015-03-17 09:56:48,578:27147(0x7f45031d5700):ZOO_INFO@log_env@712: Client 
environment:zookeeper.version=zookeeper C client 3.4.5
2015-03-17 09:56:48,578:27147(0x7f45031d5700):ZOO_INFO@log_env@716: Client 
environment:host.name=server-ubuntu-updatecluster
2015-03-17 09:56:48,578:27147(0x7f45031d5700):ZOO_INFO@log_env@723: Client 
environment:os.name=Linux
2015-03-17 09:56:48,578:27147(0x7f45031d5700):ZOO_INFO@log_env@724: Client 
environment:os.arch=3.13.0-46-generic
2015-03-17 09:56:48,578:27147(0x7f45031d5700):ZOO_INFO@log_env@725: Client 
environment:os.version=#75-Ubuntu SMP Tue Feb 10 15:24:04 UTC 2015
I0317 09:56:48.580657 27147 slave.cpp:169] Slave started on 
1)@10.195.96.237:5051
2015-03-17 09:56:48,581:27147(0x7f45031d5700):ZOO_INFO@log_env@733: Client 
environment:user.name=(null)
2015-03-17 09:56:48,581:27147(0x7f45031d5700):ZOO_INFO@log_env@741: Client 
environment:user.home=/root
2015-03-17 09:56:48,581:27147(0x7f45031d5700):ZOO_INFO@log_env@753: Client 
environment:user.dir=/
2015-03-17 09:56:48,581:27147(0x7f45031d5700):ZOO_INFO@zookeeper_init@786: 
Initiating client connection, host=10.195.96.237:2181 sessionTimeout=10000 
watcher=0x7f450939a6a0 sessionId=0 sessionPasswd=<null> context=0x7f44f0000b70 
flags=0
I0317 09:56:48.581478 27147 slave.cpp:289] Slave resources: 
ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):1979; disk(*):22974
I0317 09:56:48.581533 27147 slave.cpp:318] Slave hostname: 10.195.96.237
I0317 09:56:48.581548 27147 slave.cpp:319] Slave checkpoint: true
I0317 09:56:48.583925 27175 state.cpp:33] Recovering state from 
'/tmp/mesos/meta'
2015-03-17 09:56:48,586:27147(0x7f4501196700):ZOO_INFO@check_events@1703: 
initiated connection to server [10.195.96.237:2181]
2015-03-17 09:56:48,589:27147(0x7f4501196700):ZOO_INFO@check_events@1750: 
session establishment complete on server [10.195.96.237:2181], 
sessionId=0x14c0e422f7f0009, negotiated timeout=10000
I0317 09:56:48.589994 27181 group.cpp:313] Group process 
(group(1)@10.195.96.237:5051) connected to ZooKeeper
I0317 09:56:48.590104 27181 group.cpp:790] Syncing group operations: queue size 
(joins, cancels, datas) = (0, 0, 0)
I0317 09:56:48.590137 27181 group.cpp:385] Trying to create path '/mesos' in 
ZooKeeper
I0317 09:56:48.592377 27181 detector.cpp:138] Detected a new leader: (id='2')
I0317 09:56:48.592711 27181 group.cpp:659] Trying to get 
'/mesos/info_0000000002' in ZooKeeper
I0317 09:56:48.593904 27181 detector.cpp:433] A new leading master 
([email protected]:5050) is detected
{code}


3) Mesos slave tries to recover the docker container 
*adb71dc4-c07d-42a9-8fed-264c241668ad*. Shouldn't be container 
*e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330* ?

{code}
I0317 09:56:48.592711 27181 group.cpp:659] Trying to get 
'/mesos/info_0000000002' in ZooKeeper
I0317 09:56:48.593904 27181 detector.cpp:433] A new leading master 
([email protected]:5050) is detected
I0317 09:56:48.613559 27175 slave.cpp:3548] Recovering framework 
20150311-150951-3982541578-5050-50860-0000
I0317 09:56:48.613596 27175 slave.cpp:4116] Recovering executor 
'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework 
20150311-150951-3982541578-5050-50860-0000
I0317 09:56:48.613919 27181 status_update_manager.cpp:197] Recovering status 
update manager
I0317 09:56:48.613945 27181 status_update_manager.cpp:205] Recovering executor 
'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework 
20150311-150951-3982541578-5050-50860-0000
I0317 09:56:48.614933 27181 docker.cpp:767] Recovering Docker containers
I0317 09:56:48.614982 27181 docker.cpp:814] Recovering container 
'adb71dc4-c07d-42a9-8fed-264c241668ad' for executor 
'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework 
20150311-150951-3982541578-5050-50860-0000
I0317 09:56:48.616395 27174 containerizer.cpp:281] Recovering containerizer
I0317 09:56:48.616457 27174 containerizer.cpp:323] Recovering container 
'adb71dc4-c07d-42a9-8fed-264c241668ad' for executor 
'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework 
20150311-150951-3982541578-5050-50860-0000
{code}


4) It fails because it *does not find any cgroup for container 
adb71dc4-c07d-42a9-8fed-264c241668ad*.
Again is the "adb71dc4-c07d-42a9-8fed-264c241668ad" id something internal to 
Mesos?
The docker container has an id of 
*e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330*, and for 
example 
_/sys/fs/cgroup/memory/docker/e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330_
 and 
_/sys/fs/cgroup/freezer/docker/e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330_
 are presents on the file system.

{code}
I0317 09:56:48.616981 27174 linux_launcher.cpp:169] Couldn't find freezer 
cgroup for container adb71dc4-c07d-42a9-8fed-264c241668ad, assuming already 
destroyed
W0317 09:56:48.634775 27175 cpushare.cpp:203] Couldn't find cgroup for 
container adb71dc4-c07d-42a9-8fed-264c241668ad
E0317 09:56:48.636764 27180 containerizer.cpp:1140] Error in a resource 
limitation for container adb71dc4-c07d-42a9-8fed-264c241668ad: Unknown container
I0317 09:56:48.636808 27180 containerizer.cpp:946] Destroying container 
'adb71dc4-c07d-42a9-8fed-264c241668ad'
E0317 09:56:48.640025 27175 containerizer.cpp:1140] Error in a resource 
limitation for container adb71dc4-c07d-42a9-8fed-264c241668ad: Unknown container
I0317 09:56:48.823640 27177 slave.cpp:3407] Sending reconnect request to 
executor test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework 
20150311-150951-3982541578-5050-50860-0000 at executor(1)@10.195.96.237:36021
I0317 09:56:48.826930 27178 slave.cpp:2039] Re-registering executor 
test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework 
20150311-150951-3982541578-5050-50860-0000
W0317 09:56:49.829905 27181 containerizer.cpp:888] Skipping resource statistic 
for container adb71dc4-c07d-42a9-8fed-264c241668ad because: Unknown container
W0317 09:56:49.830080 27181 containerizer.cpp:888] Skipping resource statistic 
for container adb71dc4-c07d-42a9-8fed-264c241668ad because: Unknown container
I0317 09:56:50.824409 27178 slave.cpp:2164] Cleaning up un-reregistered 
executors
I0317 09:56:50.824524 27178 slave.cpp:3466] Finished recovery
{code}


5) Loop error message on the slave log, spamming the log file

{code}
W0317 09:56:50.830857 27175 containerizer.cpp:888] Skipping resource statistic 
for container adb71dc4-c07d-42a9-8fed-264c241668ad because: Unknown container
W0317 09:56:50.830904 27175 containerizer.cpp:888] Skipping resource statistic 
for container adb71dc4-c07d-42a9-8fed-264c241668ad because: Unknown container
W0317 09:56:51.832145 27177 containerizer.cpp:888] Skipping resource statistic 
for container adb71dc4-c07d-42a9-8fed-264c241668ad because: Unknown container
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to