[ https://issues.apache.org/jira/browse/MESOS-2508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14370870#comment-14370870 ]
Timothy Chen commented on MESOS-2508: ------------------------------------- Hi there, I think this is a known issue. Basically all containerizers attempts to recover all the tasks that aren't even launched by itself, so in this case Docker is trying to recover Mesos containerizer tasks . > Slave recovering a docker container results in "Unknow container" error > ----------------------------------------------------------------------- > > Key: MESOS-2508 > URL: https://issues.apache.org/jira/browse/MESOS-2508 > Project: Mesos > Issue Type: Bug > Components: containerization, docker, slave > Affects Versions: 0.21.1 > Environment: Ubuntu 14.04.2 LTS > Docker 1.5.0 (same error with 1.4.1) > Mesos 0.21.1 installed from mesosphere ubuntu repo > Marathon 0.8.0 installed from mesosphere ubuntu repo > Reporter: Geoffroy Jabouley > > I'm seeing some error logs occuring during a slave recovery of a Mesos task > running into a docker container. > It does not impede slave recovery process, as the mesos task is still active > and running on the slave afeter the recovery. > But there is something not working properly when the slave is recovering my > docker container. The slave detects my container as an "Unknown container" > Cluster status: > - 1 mesos-master, 1 mesos-slave, 1 marathon framework running on the host. > - checkpointing is activated on both slave and framework > - use native docker containerizer > - 1 mesos task, started using marathon, is running inside a docker container > and is monitored by the mesos-slave > Action: > - restart the mesos-slave process (sudo restart mesos-slave) > Expected: > - docker container still running > - mesos task still running > - no error in the mesos slave log regarding recovery process > Seen: > - docker container still running > - mesos task still running > - {color:red}Several errors *Unknown container* in the mesos slave log during > recovery process{color} > ----------------------------------- > For what it forth, here are my investigations: > 1) The mesos task starts fine in the docker container > *e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330*. Docker > container name is set to *mesos-adb71dc4-c07d-42a9-8fed-264c241668ad* by > Mesos docker containerizer _i guess_... > {code} > I0317 09:56:14.300439 2784 slave.cpp:1083] Got assigned task > test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 for framework > 20150311-150951-3982541578-5050-50860-0000 > I0317 09:56:14.380702 2784 slave.cpp:1193] Launching task > test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 for framework > 20150311-150951-3982541578-5050-50860-0000 > I0317 09:56:14.384466 2784 slave.cpp:3997] Launching executor > test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework > 20150311-150951-3982541578-5050-50860-0000 in work directory > '/tmp/mesos/slaves/20150312-145235-3982541578-5050-1421-S0/frameworks/20150311-150951-3982541578-5050-50860-0000/executors/test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799/runs/adb71dc4-c07d-42a9-8fed-264c241668ad' > I0317 09:56:14.390207 2784 slave.cpp:1316] Queuing task > 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' for executor > test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework > '20150311-150951-3982541578-5050-50860-0000 > I0317 09:56:14.421787 2782 docker.cpp:927] Starting container > 'adb71dc4-c07d-42a9-8fed-264c241668ad' for task > 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' (and executor > 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799') of framework > '20150311-150951-3982541578-5050-50860-0000' > I0317 09:56:15.784143 2781 docker.cpp:633] Checkpointing pid 27080 to > '/tmp/mesos/meta/slaves/20150312-145235-3982541578-5050-1421-S0/frameworks/20150311-150951-3982541578-5050-50860-0000/executors/test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799/runs/adb71dc4-c07d-42a9-8fed-264c241668ad/pids/forked.pid' > I0317 09:56:15.789443 2784 slave.cpp:2840] Monitoring executor > 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework > '20150311-150951-3982541578-5050-50860-0000' in container > 'adb71dc4-c07d-42a9-8fed-264c241668ad' > I0317 09:56:15.862642 2784 slave.cpp:1860] Got registration for executor > 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework > 20150311-150951-3982541578-5050-50860-0000 from > executor(1)@10.195.96.237:36021 > I0317 09:56:15.865319 2784 slave.cpp:1979] Flushing queued task > test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 for executor > 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework > 20150311-150951-3982541578-5050-50860-0000 > I0317 09:56:15.885414 2787 slave.cpp:2215] Handling status update > TASK_RUNNING (UUID: 79f49cec-92c7-4660-b54e-22dd19c1e67c) for task > test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework > 20150311-150951-3982541578-5050-50860-0000 from > executor(1)@10.195.96.237:36021 > I0317 09:56:15.885902 2787 status_update_manager.cpp:317] Received status > update TASK_RUNNING (UUID: 79f49cec-92c7-4660-b54e-22dd19c1e67c) for task > test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework > 20150311-150951-3982541578-5050-50860-0000 > I0317 09:56:15.886327 2787 status_update_manager.hpp:346] Checkpointing > UPDATE for status update TASK_RUNNING (UUID: > 79f49cec-92c7-4660-b54e-22dd19c1e67c) for task > test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework > 20150311-150951-3982541578-5050-50860-0000 > I0317 09:56:15.904325 2785 docker.cpp:1298] Updated 'cpu.shares' to 204 at > /sys/fs/cgroup/cpu/docker/e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330 > for container adb71dc4-c07d-42a9-8fed-264c241668ad > I0317 09:56:15.904831 2785 docker.cpp:1333] Updated > 'memory.soft_limit_in_bytes' to 96MB for container > adb71dc4-c07d-42a9-8fed-264c241668ad > I0317 09:56:15.905591 2785 docker.cpp:1359] Updated 'memory.limit_in_bytes' > to 96MB at > /sys/fs/cgroup/memory/docker/e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330 > for container adb71dc4-c07d-42a9-8fed-264c241668ad > I0317 09:56:16.073200 2787 slave.cpp:2458] Forwarding the update > TASK_RUNNING (UUID: 79f49cec-92c7-4660-b54e-22dd19c1e67c) for task > test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework > 20150311-150951-3982541578-5050-50860-0000 to master@10.195.96.237:5050 > I0317 09:56:16.073386 2787 slave.cpp:2391] Sending acknowledgement for > status update TASK_RUNNING (UUID: 79f49cec-92c7-4660-b54e-22dd19c1e67c) for > task test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework > 20150311-150951-3982541578-5050-50860-0000 to executor(1)@10.195.96.237:36021 > I0317 09:56:16.226022 2788 status_update_manager.cpp:389] Received status > update acknowledgement (UUID: 79f49cec-92c7-4660-b54e-22dd19c1e67c) for task > test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework > 20150311-150951-3982541578-5050-50860-0000 > I0317 09:56:16.226093 2788 status_update_manager.hpp:346] Checkpointing ACK > for status update TASK_RUNNING (UUID: 79f49cec-92c7-4660-b54e-22dd19c1e67c) > for task test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework > 20150311-150951-3982541578-5050-50860-0000 > {code} > {code} > $ docker ps --no-trunc > CONTAINER ID IMAGE > COMMAND CREATED > STATUS PORTS NAMES > e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330 > ecpff/ecp_base:latest "/bin/sh -c 'env && ls -Ral /mnt && sleep 3600s'" > 11 minutes ago Up 11 minutes > mesos-adb71dc4-c07d-42a9-8fed-264c241668ad > {code} > 2) I restart the mesos slave process > {code} > $ sudo restart mesos-slave > {code} > {code} > W0317 09:56:21.672792 2754 logging.cpp:81] RAW: Received signal SIGTERM from > process 1 of user 0; exiting > [DEBUG] /usr/sbin/mesos-slave --master=zk://10.195.96.237:2181/mesos > --checkpoint --containerizers=docker,mesos > --executor_registration_timeout=5mins --hostname=10.195.96.237 > --ip=10.195.96.237 --isolation=cgroups/cpu,cgroups/mem --recover=reconnect > --recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443] > I0317 09:56:48.411228 27147 main.cpp:142] Build: 2015-01-09 02:25:56 by root > I0317 09:56:48.411418 27147 main.cpp:144] Version: 0.21.1 > I0317 09:56:48.411427 27147 main.cpp:147] Git tag: 0.21.1 > I0317 09:56:48.411434 27147 main.cpp:151] Git SHA: > 2ae1ba91e64f92ec71d327e10e6ba9e8ad5477e8 > I0317 09:56:48.521886 27147 containerizer.cpp:100] Using isolation: > cgroups/cpu,cgroups/mem > I0317 09:56:48.570104 27147 linux_launcher.cpp:94] Using > /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher > I0317 09:56:48.577716 27147 main.cpp:165] Starting Mesos slave > 2015-03-17 09:56:48,578:27147(0x7f45031d5700):ZOO_INFO@log_env@712: Client > environment:zookeeper.version=zookeeper C client 3.4.5 > 2015-03-17 09:56:48,578:27147(0x7f45031d5700):ZOO_INFO@log_env@716: Client > environment:host.name=server-ubuntu-updatecluster > 2015-03-17 09:56:48,578:27147(0x7f45031d5700):ZOO_INFO@log_env@723: Client > environment:os.name=Linux > 2015-03-17 09:56:48,578:27147(0x7f45031d5700):ZOO_INFO@log_env@724: Client > environment:os.arch=3.13.0-46-generic > 2015-03-17 09:56:48,578:27147(0x7f45031d5700):ZOO_INFO@log_env@725: Client > environment:os.version=#75-Ubuntu SMP Tue Feb 10 15:24:04 UTC 2015 > I0317 09:56:48.580657 27147 slave.cpp:169] Slave started on > 1)@10.195.96.237:5051 > 2015-03-17 09:56:48,581:27147(0x7f45031d5700):ZOO_INFO@log_env@733: Client > environment:user.name=(null) > 2015-03-17 09:56:48,581:27147(0x7f45031d5700):ZOO_INFO@log_env@741: Client > environment:user.home=/root > 2015-03-17 09:56:48,581:27147(0x7f45031d5700):ZOO_INFO@log_env@753: Client > environment:user.dir=/ > 2015-03-17 09:56:48,581:27147(0x7f45031d5700):ZOO_INFO@zookeeper_init@786: > Initiating client connection, host=10.195.96.237:2181 sessionTimeout=10000 > watcher=0x7f450939a6a0 sessionId=0 sessionPasswd=<null> > context=0x7f44f0000b70 flags=0 > I0317 09:56:48.581478 27147 slave.cpp:289] Slave resources: > ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):1979; disk(*):22974 > I0317 09:56:48.581533 27147 slave.cpp:318] Slave hostname: 10.195.96.237 > I0317 09:56:48.581548 27147 slave.cpp:319] Slave checkpoint: true > I0317 09:56:48.583925 27175 state.cpp:33] Recovering state from > '/tmp/mesos/meta' > 2015-03-17 09:56:48,586:27147(0x7f4501196700):ZOO_INFO@check_events@1703: > initiated connection to server [10.195.96.237:2181] > 2015-03-17 09:56:48,589:27147(0x7f4501196700):ZOO_INFO@check_events@1750: > session establishment complete on server [10.195.96.237:2181], > sessionId=0x14c0e422f7f0009, negotiated timeout=10000 > I0317 09:56:48.589994 27181 group.cpp:313] Group process > (group(1)@10.195.96.237:5051) connected to ZooKeeper > I0317 09:56:48.590104 27181 group.cpp:790] Syncing group operations: queue > size (joins, cancels, datas) = (0, 0, 0) > I0317 09:56:48.590137 27181 group.cpp:385] Trying to create path '/mesos' in > ZooKeeper > I0317 09:56:48.592377 27181 detector.cpp:138] Detected a new leader: (id='2') > I0317 09:56:48.592711 27181 group.cpp:659] Trying to get > '/mesos/info_0000000002' in ZooKeeper > I0317 09:56:48.593904 27181 detector.cpp:433] A new leading master > (UPID=master@10.195.96.237:5050) is detected > {code} > 3) Mesos slave tries to recover the docker container > *adb71dc4-c07d-42a9-8fed-264c241668ad*. Shouldn't be container > *e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330* ? > {code} > I0317 09:56:48.592711 27181 group.cpp:659] Trying to get > '/mesos/info_0000000002' in ZooKeeper > I0317 09:56:48.593904 27181 detector.cpp:433] A new leading master > (UPID=master@10.195.96.237:5050) is detected > I0317 09:56:48.613559 27175 slave.cpp:3548] Recovering framework > 20150311-150951-3982541578-5050-50860-0000 > I0317 09:56:48.613596 27175 slave.cpp:4116] Recovering executor > 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework > 20150311-150951-3982541578-5050-50860-0000 > I0317 09:56:48.613919 27181 status_update_manager.cpp:197] Recovering status > update manager > I0317 09:56:48.613945 27181 status_update_manager.cpp:205] Recovering > executor 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework > 20150311-150951-3982541578-5050-50860-0000 > I0317 09:56:48.614933 27181 docker.cpp:767] Recovering Docker containers > I0317 09:56:48.614982 27181 docker.cpp:814] Recovering container > 'adb71dc4-c07d-42a9-8fed-264c241668ad' for executor > 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework > 20150311-150951-3982541578-5050-50860-0000 > I0317 09:56:48.616395 27174 containerizer.cpp:281] Recovering containerizer > I0317 09:56:48.616457 27174 containerizer.cpp:323] Recovering container > 'adb71dc4-c07d-42a9-8fed-264c241668ad' for executor > 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework > 20150311-150951-3982541578-5050-50860-0000 > {code} > 4) It fails because it *does not find any cgroup for container > adb71dc4-c07d-42a9-8fed-264c241668ad*. > Again is the "adb71dc4-c07d-42a9-8fed-264c241668ad" id something internal to > Mesos? > The docker container has an id of > *e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330*, and for > example > _/sys/fs/cgroup/memory/docker/e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330_ > and > _/sys/fs/cgroup/freezer/docker/e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330_ > are presents on the file system. > {code} > I0317 09:56:48.616981 27174 linux_launcher.cpp:169] Couldn't find freezer > cgroup for container adb71dc4-c07d-42a9-8fed-264c241668ad, assuming already > destroyed > W0317 09:56:48.634775 27175 cpushare.cpp:203] Couldn't find cgroup for > container adb71dc4-c07d-42a9-8fed-264c241668ad > E0317 09:56:48.636764 27180 containerizer.cpp:1140] Error in a resource > limitation for container adb71dc4-c07d-42a9-8fed-264c241668ad: Unknown > container > I0317 09:56:48.636808 27180 containerizer.cpp:946] Destroying container > 'adb71dc4-c07d-42a9-8fed-264c241668ad' > E0317 09:56:48.640025 27175 containerizer.cpp:1140] Error in a resource > limitation for container adb71dc4-c07d-42a9-8fed-264c241668ad: Unknown > container > I0317 09:56:48.823640 27177 slave.cpp:3407] Sending reconnect request to > executor test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework > 20150311-150951-3982541578-5050-50860-0000 at executor(1)@10.195.96.237:36021 > I0317 09:56:48.826930 27178 slave.cpp:2039] Re-registering executor > test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework > 20150311-150951-3982541578-5050-50860-0000 > W0317 09:56:49.829905 27181 containerizer.cpp:888] Skipping resource > statistic for container adb71dc4-c07d-42a9-8fed-264c241668ad because: Unknown > container > W0317 09:56:49.830080 27181 containerizer.cpp:888] Skipping resource > statistic for container adb71dc4-c07d-42a9-8fed-264c241668ad because: Unknown > container > I0317 09:56:50.824409 27178 slave.cpp:2164] Cleaning up un-reregistered > executors > I0317 09:56:50.824524 27178 slave.cpp:3466] Finished recovery > {code} > 5) Loop error message on the slave log, spamming the log file > {code} > W0317 09:56:50.830857 27175 containerizer.cpp:888] Skipping resource > statistic for container adb71dc4-c07d-42a9-8fed-264c241668ad because: Unknown > container > W0317 09:56:50.830904 27175 containerizer.cpp:888] Skipping resource > statistic for container adb71dc4-c07d-42a9-8fed-264c241668ad because: Unknown > container > W0317 09:56:51.832145 27177 containerizer.cpp:888] Skipping resource > statistic for container adb71dc4-c07d-42a9-8fed-264c241668ad because: Unknown > container > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)