[jira] [Commented] (MESOS-2508) Slave recovering a docker container results in "Unknow container" error

Timothy Chen (JIRA) Thu, 19 Mar 2015 23:59:49 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-2508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14370870#comment-14370870
 ]


Timothy Chen commented on MESOS-2508:
-------------------------------------

Hi there, I think this is a known issue. Basically all containerizers attempts 
to recover all the tasks that aren't even launched by itself, so in this case 
Docker is trying to recover Mesos containerizer tasks .



> Slave recovering a docker container results in "Unknow container" error
> -----------------------------------------------------------------------
>
>                 Key: MESOS-2508
>                 URL: https://issues.apache.org/jira/browse/MESOS-2508
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization, docker, slave
>    Affects Versions: 0.21.1
>         Environment: Ubuntu 14.04.2 LTS
> Docker 1.5.0 (same error with 1.4.1)
> Mesos 0.21.1 installed from mesosphere ubuntu repo
> Marathon 0.8.0 installed from mesosphere ubuntu repo
>            Reporter: Geoffroy Jabouley
>
> I'm seeing some error logs occuring during a slave recovery of a Mesos task 
> running into a docker container.
> It does not impede slave recovery process, as the mesos task is still active 
> and running on the slave afeter the recovery.
> But there is something not working properly when the slave is recovering my 
> docker container. The slave detects my container as an "Unknown container"
> Cluster status:
> - 1 mesos-master, 1 mesos-slave, 1 marathon framework running on the host.
> - checkpointing is activated on both slave and framework
> - use native docker containerizer
> - 1 mesos task, started using marathon, is running inside a docker container 
> and is monitored by the mesos-slave
> Action:
> - restart the mesos-slave process (sudo restart mesos-slave)
> Expected:
> - docker container still running
> - mesos task still running
> - no error in the mesos slave log regarding recovery process
> Seen:
> - docker container still running
> - mesos task still running
> - {color:red}Several errors *Unknown container* in the mesos slave log during 
> recovery process{color}
> -----------------------------------
> For what it forth, here are my investigations:
> 1) The mesos task starts fine in the docker container 
> *e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330*. Docker 
> container name is set to *mesos-adb71dc4-c07d-42a9-8fed-264c241668ad* by 
> Mesos docker containerizer _i guess_...
> {code}
> I0317 09:56:14.300439  2784 slave.cpp:1083] Got assigned task 
> test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 for framework 
> 20150311-150951-3982541578-5050-50860-0000
> I0317 09:56:14.380702  2784 slave.cpp:1193] Launching task 
> test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 for framework 
> 20150311-150951-3982541578-5050-50860-0000
> I0317 09:56:14.384466  2784 slave.cpp:3997] Launching executor 
> test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework 
> 20150311-150951-3982541578-5050-50860-0000 in work directory 
> '/tmp/mesos/slaves/20150312-145235-3982541578-5050-1421-S0/frameworks/20150311-150951-3982541578-5050-50860-0000/executors/test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799/runs/adb71dc4-c07d-42a9-8fed-264c241668ad'
> I0317 09:56:14.390207  2784 slave.cpp:1316] Queuing task 
> 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' for executor 
> test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework 
> '20150311-150951-3982541578-5050-50860-0000
> I0317 09:56:14.421787  2782 docker.cpp:927] Starting container 
> 'adb71dc4-c07d-42a9-8fed-264c241668ad' for task 
> 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' (and executor 
> 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799') of framework 
> '20150311-150951-3982541578-5050-50860-0000'
> I0317 09:56:15.784143  2781 docker.cpp:633] Checkpointing pid 27080 to 
> '/tmp/mesos/meta/slaves/20150312-145235-3982541578-5050-1421-S0/frameworks/20150311-150951-3982541578-5050-50860-0000/executors/test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799/runs/adb71dc4-c07d-42a9-8fed-264c241668ad/pids/forked.pid'
> I0317 09:56:15.789443  2784 slave.cpp:2840] Monitoring executor 
> 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework 
> '20150311-150951-3982541578-5050-50860-0000' in container 
> 'adb71dc4-c07d-42a9-8fed-264c241668ad'
> I0317 09:56:15.862642  2784 slave.cpp:1860] Got registration for executor 
> 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework 
> 20150311-150951-3982541578-5050-50860-0000 from 
> executor(1)@10.195.96.237:36021
> I0317 09:56:15.865319  2784 slave.cpp:1979] Flushing queued task 
> test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 for executor 
> 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework 
> 20150311-150951-3982541578-5050-50860-0000
> I0317 09:56:15.885414  2787 slave.cpp:2215] Handling status update 
> TASK_RUNNING (UUID: 79f49cec-92c7-4660-b54e-22dd19c1e67c) for task 
> test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework 
> 20150311-150951-3982541578-5050-50860-0000 from 
> executor(1)@10.195.96.237:36021
> I0317 09:56:15.885902  2787 status_update_manager.cpp:317] Received status 
> update TASK_RUNNING (UUID: 79f49cec-92c7-4660-b54e-22dd19c1e67c) for task 
> test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework 
> 20150311-150951-3982541578-5050-50860-0000
> I0317 09:56:15.886327  2787 status_update_manager.hpp:346] Checkpointing 
> UPDATE for status update TASK_RUNNING (UUID: 
> 79f49cec-92c7-4660-b54e-22dd19c1e67c) for task 
> test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework 
> 20150311-150951-3982541578-5050-50860-0000
> I0317 09:56:15.904325  2785 docker.cpp:1298] Updated 'cpu.shares' to 204 at 
> /sys/fs/cgroup/cpu/docker/e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330
>  for container adb71dc4-c07d-42a9-8fed-264c241668ad
> I0317 09:56:15.904831  2785 docker.cpp:1333] Updated 
> 'memory.soft_limit_in_bytes' to 96MB for container 
> adb71dc4-c07d-42a9-8fed-264c241668ad
> I0317 09:56:15.905591  2785 docker.cpp:1359] Updated 'memory.limit_in_bytes' 
> to 96MB at 
> /sys/fs/cgroup/memory/docker/e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330
>  for container adb71dc4-c07d-42a9-8fed-264c241668ad
> I0317 09:56:16.073200  2787 slave.cpp:2458] Forwarding the update 
> TASK_RUNNING (UUID: 79f49cec-92c7-4660-b54e-22dd19c1e67c) for task 
> test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework 
> 20150311-150951-3982541578-5050-50860-0000 to master@10.195.96.237:5050
> I0317 09:56:16.073386  2787 slave.cpp:2391] Sending acknowledgement for 
> status update TASK_RUNNING (UUID: 79f49cec-92c7-4660-b54e-22dd19c1e67c) for 
> task test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework 
> 20150311-150951-3982541578-5050-50860-0000 to executor(1)@10.195.96.237:36021
> I0317 09:56:16.226022  2788 status_update_manager.cpp:389] Received status 
> update acknowledgement (UUID: 79f49cec-92c7-4660-b54e-22dd19c1e67c) for task 
> test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework 
> 20150311-150951-3982541578-5050-50860-0000
> I0317 09:56:16.226093  2788 status_update_manager.hpp:346] Checkpointing ACK 
> for status update TASK_RUNNING (UUID: 79f49cec-92c7-4660-b54e-22dd19c1e67c) 
> for task test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework 
> 20150311-150951-3982541578-5050-50860-0000
> {code}
> {code}
> $ docker ps --no-trunc
> CONTAINER ID                                                       IMAGE      
>              COMMAND                                             CREATED      
>        STATUS              PORTS               NAMES
> e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330   
> ecpff/ecp_base:latest   "/bin/sh -c 'env && ls -Ral /mnt && sleep 3600s'"   
> 11 minutes ago      Up 11 minutes                           
> mesos-adb71dc4-c07d-42a9-8fed-264c241668ad
> {code}
> 2) I restart the mesos slave process
> {code}
> $ sudo restart mesos-slave
> {code}
> {code}
> W0317 09:56:21.672792  2754 logging.cpp:81] RAW: Received signal SIGTERM from 
> process 1 of user 0; exiting
> [DEBUG] /usr/sbin/mesos-slave --master=zk://10.195.96.237:2181/mesos 
> --checkpoint --containerizers=docker,mesos 
> --executor_registration_timeout=5mins --hostname=10.195.96.237 
> --ip=10.195.96.237 --isolation=cgroups/cpu,cgroups/mem --recover=reconnect 
> --recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443]
> I0317 09:56:48.411228 27147 main.cpp:142] Build: 2015-01-09 02:25:56 by root
> I0317 09:56:48.411418 27147 main.cpp:144] Version: 0.21.1
> I0317 09:56:48.411427 27147 main.cpp:147] Git tag: 0.21.1
> I0317 09:56:48.411434 27147 main.cpp:151] Git SHA: 
> 2ae1ba91e64f92ec71d327e10e6ba9e8ad5477e8
> I0317 09:56:48.521886 27147 containerizer.cpp:100] Using isolation: 
> cgroups/cpu,cgroups/mem
> I0317 09:56:48.570104 27147 linux_launcher.cpp:94] Using 
> /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> I0317 09:56:48.577716 27147 main.cpp:165] Starting Mesos slave
> 2015-03-17 09:56:48,578:27147(0x7f45031d5700):ZOO_INFO@log_env@712: Client 
> environment:zookeeper.version=zookeeper C client 3.4.5
> 2015-03-17 09:56:48,578:27147(0x7f45031d5700):ZOO_INFO@log_env@716: Client 
> environment:host.name=server-ubuntu-updatecluster
> 2015-03-17 09:56:48,578:27147(0x7f45031d5700):ZOO_INFO@log_env@723: Client 
> environment:os.name=Linux
> 2015-03-17 09:56:48,578:27147(0x7f45031d5700):ZOO_INFO@log_env@724: Client 
> environment:os.arch=3.13.0-46-generic
> 2015-03-17 09:56:48,578:27147(0x7f45031d5700):ZOO_INFO@log_env@725: Client 
> environment:os.version=#75-Ubuntu SMP Tue Feb 10 15:24:04 UTC 2015
> I0317 09:56:48.580657 27147 slave.cpp:169] Slave started on 
> 1)@10.195.96.237:5051
> 2015-03-17 09:56:48,581:27147(0x7f45031d5700):ZOO_INFO@log_env@733: Client 
> environment:user.name=(null)
> 2015-03-17 09:56:48,581:27147(0x7f45031d5700):ZOO_INFO@log_env@741: Client 
> environment:user.home=/root
> 2015-03-17 09:56:48,581:27147(0x7f45031d5700):ZOO_INFO@log_env@753: Client 
> environment:user.dir=/
> 2015-03-17 09:56:48,581:27147(0x7f45031d5700):ZOO_INFO@zookeeper_init@786: 
> Initiating client connection, host=10.195.96.237:2181 sessionTimeout=10000 
> watcher=0x7f450939a6a0 sessionId=0 sessionPasswd=<null> 
> context=0x7f44f0000b70 flags=0
> I0317 09:56:48.581478 27147 slave.cpp:289] Slave resources: 
> ports(*):[31000-32000, 80-443]; cpus(*):2; mem(*):1979; disk(*):22974
> I0317 09:56:48.581533 27147 slave.cpp:318] Slave hostname: 10.195.96.237
> I0317 09:56:48.581548 27147 slave.cpp:319] Slave checkpoint: true
> I0317 09:56:48.583925 27175 state.cpp:33] Recovering state from 
> '/tmp/mesos/meta'
> 2015-03-17 09:56:48,586:27147(0x7f4501196700):ZOO_INFO@check_events@1703: 
> initiated connection to server [10.195.96.237:2181]
> 2015-03-17 09:56:48,589:27147(0x7f4501196700):ZOO_INFO@check_events@1750: 
> session establishment complete on server [10.195.96.237:2181], 
> sessionId=0x14c0e422f7f0009, negotiated timeout=10000
> I0317 09:56:48.589994 27181 group.cpp:313] Group process 
> (group(1)@10.195.96.237:5051) connected to ZooKeeper
> I0317 09:56:48.590104 27181 group.cpp:790] Syncing group operations: queue 
> size (joins, cancels, datas) = (0, 0, 0)
> I0317 09:56:48.590137 27181 group.cpp:385] Trying to create path '/mesos' in 
> ZooKeeper
> I0317 09:56:48.592377 27181 detector.cpp:138] Detected a new leader: (id='2')
> I0317 09:56:48.592711 27181 group.cpp:659] Trying to get 
> '/mesos/info_0000000002' in ZooKeeper
> I0317 09:56:48.593904 27181 detector.cpp:433] A new leading master 
> (UPID=master@10.195.96.237:5050) is detected
> {code}
> 3) Mesos slave tries to recover the docker container 
> *adb71dc4-c07d-42a9-8fed-264c241668ad*. Shouldn't be container 
> *e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330* ?
> {code}
> I0317 09:56:48.592711 27181 group.cpp:659] Trying to get 
> '/mesos/info_0000000002' in ZooKeeper
> I0317 09:56:48.593904 27181 detector.cpp:433] A new leading master 
> (UPID=master@10.195.96.237:5050) is detected
> I0317 09:56:48.613559 27175 slave.cpp:3548] Recovering framework 
> 20150311-150951-3982541578-5050-50860-0000
> I0317 09:56:48.613596 27175 slave.cpp:4116] Recovering executor 
> 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework 
> 20150311-150951-3982541578-5050-50860-0000
> I0317 09:56:48.613919 27181 status_update_manager.cpp:197] Recovering status 
> update manager
> I0317 09:56:48.613945 27181 status_update_manager.cpp:205] Recovering 
> executor 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework 
> 20150311-150951-3982541578-5050-50860-0000
> I0317 09:56:48.614933 27181 docker.cpp:767] Recovering Docker containers
> I0317 09:56:48.614982 27181 docker.cpp:814] Recovering container 
> 'adb71dc4-c07d-42a9-8fed-264c241668ad' for executor 
> 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework 
> 20150311-150951-3982541578-5050-50860-0000
> I0317 09:56:48.616395 27174 containerizer.cpp:281] Recovering containerizer
> I0317 09:56:48.616457 27174 containerizer.cpp:323] Recovering container 
> 'adb71dc4-c07d-42a9-8fed-264c241668ad' for executor 
> 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework 
> 20150311-150951-3982541578-5050-50860-0000
> {code}
> 4) It fails because it *does not find any cgroup for container 
> adb71dc4-c07d-42a9-8fed-264c241668ad*.
> Again is the "adb71dc4-c07d-42a9-8fed-264c241668ad" id something internal to 
> Mesos?
> The docker container has an id of 
> *e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330*, and for 
> example 
> _/sys/fs/cgroup/memory/docker/e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330_
>  and 
> _/sys/fs/cgroup/freezer/docker/e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330_
>  are presents on the file system.
> {code}
> I0317 09:56:48.616981 27174 linux_launcher.cpp:169] Couldn't find freezer 
> cgroup for container adb71dc4-c07d-42a9-8fed-264c241668ad, assuming already 
> destroyed
> W0317 09:56:48.634775 27175 cpushare.cpp:203] Couldn't find cgroup for 
> container adb71dc4-c07d-42a9-8fed-264c241668ad
> E0317 09:56:48.636764 27180 containerizer.cpp:1140] Error in a resource 
> limitation for container adb71dc4-c07d-42a9-8fed-264c241668ad: Unknown 
> container
> I0317 09:56:48.636808 27180 containerizer.cpp:946] Destroying container 
> 'adb71dc4-c07d-42a9-8fed-264c241668ad'
> E0317 09:56:48.640025 27175 containerizer.cpp:1140] Error in a resource 
> limitation for container adb71dc4-c07d-42a9-8fed-264c241668ad: Unknown 
> container
> I0317 09:56:48.823640 27177 slave.cpp:3407] Sending reconnect request to 
> executor test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework 
> 20150311-150951-3982541578-5050-50860-0000 at executor(1)@10.195.96.237:36021
> I0317 09:56:48.826930 27178 slave.cpp:2039] Re-registering executor 
> test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework 
> 20150311-150951-3982541578-5050-50860-0000
> W0317 09:56:49.829905 27181 containerizer.cpp:888] Skipping resource 
> statistic for container adb71dc4-c07d-42a9-8fed-264c241668ad because: Unknown 
> container
> W0317 09:56:49.830080 27181 containerizer.cpp:888] Skipping resource 
> statistic for container adb71dc4-c07d-42a9-8fed-264c241668ad because: Unknown 
> container
> I0317 09:56:50.824409 27178 slave.cpp:2164] Cleaning up un-reregistered 
> executors
> I0317 09:56:50.824524 27178 slave.cpp:3466] Finished recovery
> {code}
> 5) Loop error message on the slave log, spamming the log file
> {code}
> W0317 09:56:50.830857 27175 containerizer.cpp:888] Skipping resource 
> statistic for container adb71dc4-c07d-42a9-8fed-264c241668ad because: Unknown 
> container
> W0317 09:56:50.830904 27175 containerizer.cpp:888] Skipping resource 
> statistic for container adb71dc4-c07d-42a9-8fed-264c241668ad because: Unknown 
> container
> W0317 09:56:51.832145 27177 containerizer.cpp:888] Skipping resource 
> statistic for container adb71dc4-c07d-42a9-8fed-264c241668ad because: Unknown 
> container
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2508) Slave recovering a docker container results in "Unknow container" error

Reply via email to