[ 
https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995653#comment-16995653
 ] 

Dalton Matos Coelho Barreto commented on MESOS-10066:
-----------------------------------------------------

Hello [~abudnik],

Any tips about this so I can debug further? Do you think web can ping anyone 
else to help me on this issue?

 

Thanks,

> mesos-docker-executor process dies when agent stops. Recovery fails when 
> agent returns
> --------------------------------------------------------------------------------------
>
>                 Key: MESOS-10066
>                 URL: https://issues.apache.org/jira/browse/MESOS-10066
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, containerization, docker, executor
>    Affects Versions: 1.7.3
>            Reporter: Dalton Matos Coelho Barreto
>            Priority: Critical
>         Attachments: logs-after.txt, logs-before.txt
>
>
> Hello all,
> The documentation about Agent Recovery shows two conditions for the recovery 
> to be possible:
>  - The agent must have recovery enabled (default true?);
>  - The scheduler must register itself saying that it has checkpointing 
> enabled.
> In my tests I'm using Marathon as the scheduler and Mesos itself sees 
> Marathon as e checkpoint-enabled scheduler:
> {noformat}
>     $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, 
> "id": .id, "checkpoint": .checkpoint, "active": .active}'
>     {
>       "name": "asgard-chronos",
>       "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001",
>       "checkpoint": true,
>       "active": true
>     }
>     {
>       "name": "marathon",
>       "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000",
>       "checkpoint": true,
>       "active": true
>     }
> }}
> {noformat}
> Here is what I'm using:
>  # Mesos Master, 1.4.1
>  # Mesos Agent 1.7.3
>  # Using docker image {{mesos/mesos-centos:1.7.x}}
>  # Docker sock mounted from the host
>  # Docker binary also mounted from the host
>  # Marathon: 1.4.12
>  # Docker
> {noformat}
>     Client: Docker Engine - Community
>      Version:           19.03.5
>      API version:       1.39 (downgraded from 1.40)
>      Go version:        go1.12.12
>      Git commit:        633a0ea838
>      Built:             Wed Nov 13 07:22:05 2019
>      OS/Arch:           linux/amd64
>      Experimental:      false
>     
>     Server: Docker Engine - Community
>      Engine:
>       Version:          18.09.2
>       API version:      1.39 (minimum version 1.12)
>       Go version:       go1.10.6
>       Git commit:       6247962
>       Built:            Sun Feb 10 03:42:13 2019
>       OS/Arch:          linux/amd64
>       Experimental:     false
> {noformat}
> h2. The problem
> Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} 
> docker image.
> {noformat}
>     {
>       "id": "/sleep",
>       "cmd": "sleep 99d",
>       "cpus": 0.1,
>       "mem": 128,
>       "disk": 0,
>       "instances": 1,
>       "constraints": [],
>       "acceptedResourceRoles": [
>         "*"
>       ],
>       "container": {
>         "type": "DOCKER",
>         "volumes": [],
>         "docker": {
>           "image": "debian",
>           "network": "HOST",
>           "privileged": false,
>           "parameters": [],
>           "forcePullImage": true
>         }
>       },
>       "labels": {},
>       "portDefinitions": []
>     }
> {noformat}
> This task runs fine and get scheduled on the right agent, which is running 
> mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}).
> Here is a sample log:
> {noformat}
>     mesos-slave_1  | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing 
> task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
>     mesos-slave_1  | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching 
> task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
>     mesos-slave_1  | I1205 13:24:21.392895 19849 paths.cpp:748] Creating 
> sandbox 
> '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923'
>     mesos-slave_1  | I1205 13:24:21.394399 19849 paths.cpp:748] Creating 
> sandbox 
> '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923'
>     mesos-slave_1  | I1205 13:24:21.394918 19849 slave.cpp:9068] Launching 
> executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 with resources 
> [{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":"mem","scalar":{"value":32.0},"type":"SCALAR"}]
>  in work directory 
> '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923'
>     mesos-slave_1  | I1205 13:24:21.396499 19849 slave.cpp:3078] Queued task 
> 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for executor 
> 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
>     mesos-slave_1  | I1205 13:24:21.397038 19849 slave.cpp:3526] Launching 
> container 53ec0ef3-3290-476a-b2b6-385099e9b923 for executor 
> 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
>     mesos-slave_1  | I1205 13:24:21.398028 19846 docker.cpp:1177] Starting 
> container '53ec0ef3-3290-476a-b2b6-385099e9b923' for task 
> 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' (and executor 
> 'sleep.8c187c41-1762-11ea-a2e5-02429217540f') of framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
>     mesos-slave_1  | W1205 13:24:22.576869 19846 slave.cpp:8496] Failed to 
> get resource statistics for executor 
> 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000: Failed to run '/usr/bin/docker -H 
> unix:///var/run/docker.sock inspect --type=container 
> mesos-53ec0ef3-3290-476a-b2b6-385099e9b923': exited with status 1; 
> stderr='Error: No such container: mesos-53ec0ef3-3290-476a-b2b6-385099e9b923'
>     mesos-slave_1  | I1205 13:24:24.094985 19853 docker.cpp:792] 
> Checkpointing pid 12435 to 
> '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923/pids/forked.pid'
>     mesos-slave_1  | I1205 13:24:24.343099 19848 slave.cpp:4839] Got 
> registration for executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of 
> framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 from 
> executor(1)@10.234.172.56:16653
>     mesos-slave_1  | I1205 13:24:24.345593 19848 docker.cpp:1685] Ignoring 
> updating container 53ec0ef3-3290-476a-b2b6-385099e9b923 because resources 
> passed to update are identical to existing resources
>     mesos-slave_1  | I1205 13:24:24.345945 19848 slave.cpp:3296] Sending 
> queued task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' to executor 
> 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 at executor(1)@10.234.172.56:16653
>     mesos-slave_1  | I1205 13:24:24.362699 19853 slave.cpp:5310] Handling 
> status update TASK_STARTING (Status UUID: 
> 8b06fe8c-d709-453b-817d-2948e50782c9) for task 
> sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 from executor(1)@10.234.172.56:16653
>     mesos-slave_1  | I1205 13:24:24.363222 19853 
> task_status_update_manager.cpp:328] Received task status update TASK_STARTING 
> (Status UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) for task 
> sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
>     mesos-slave_1  | I1205 13:24:24.364115 19853 
> task_status_update_manager.cpp:842] Checkpointing UPDATE for task status 
> update TASK_STARTING (Status UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) for 
> task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
>     mesos-slave_1  | I1205 13:24:24.364993 19847 slave.cpp:5815] Forwarding 
> the update TASK_STARTING (Status UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) 
> for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 to master@10.234.172.34:5050
>     mesos-slave_1  | I1205 13:24:24.365594 19847 slave.cpp:5726] Sending 
> acknowledgement for status update TASK_STARTING (Status UUID: 
> 8b06fe8c-d709-453b-817d-2948e50782c9) for task 
> sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 to executor(1)@10.234.172.56:16653
>     mesos-slave_1  | I1205 13:24:24.401759 19846 
> task_status_update_manager.cpp:401] Received task status update 
> acknowledgement (UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) for task 
> sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
>     mesos-slave_1  | I1205 13:24:24.401926 19846 
> task_status_update_manager.cpp:842] Checkpointing ACK for task status update 
> TASK_STARTING (Status UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) for task 
> sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
>     mesos-slave_1  | I1205 13:24:30.481829 19850 slave.cpp:5310] Handling 
> status update TASK_RUNNING (Status UUID: 
> 480abd48-6a85-4703-a2f2-0455f5c4c053) for task 
> sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 from executor(1)@10.234.172.56:16653
>     mesos-slave_1  | I1205 13:24:30.482340 19848 
> task_status_update_manager.cpp:328] Received task status update TASK_RUNNING 
> (Status UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) for task 
> sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
>     mesos-slave_1  | I1205 13:24:30.482550 19848 
> task_status_update_manager.cpp:842] Checkpointing UPDATE for task status 
> update TASK_RUNNING (Status UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) for 
> task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
>     mesos-slave_1  | I1205 13:24:30.483163 19848 slave.cpp:5815] Forwarding 
> the update TASK_RUNNING (Status UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) 
> for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 to master@10.234.172.34:5050
>     mesos-slave_1  | I1205 13:24:30.483664 19848 slave.cpp:5726] Sending 
> acknowledgement for status update TASK_RUNNING (Status UUID: 
> 480abd48-6a85-4703-a2f2-0455f5c4c053) for task 
> sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 to executor(1)@10.234.172.56:16653
>     mesos-slave_1  | I1205 13:24:30.557307 19852 
> task_status_update_manager.cpp:401] Received task status update 
> acknowledgement (UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) for task 
> sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
>     mesos-slave_1  | I1205 13:24:30.557467 19852 
> task_status_update_manager.cpp:842] Checkpointing ACK for task status update 
> TASK_RUNNING (Status UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) for task 
> sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
> {noformat}
> An important part of this log:
> {noformat}
> Sending acknowledgement for status update TASK_RUNNING for task 
> sleep.8c187c41-1762-11ea-a2e5-02429217540f to executor(1)@10.234.172.56:16653
> {noformat}
> Here we have the executor's address that is responsible for our newly created 
> task: {{executor(1)@10.234.172.56:16653}}.
> Looking at the O.S process list we see one instance of 
> {{mesos-docker-executor}}:
> {noformat}
>     ps aux | grep executor
>     root     12435  0.5  1.3 851796 54100 ?        Ssl  10:24   0:03 
> mesos-docker-executor --cgroups_enable_cfs=true 
> --container=mesos-53ec0ef3-3290-476a-b2b6-385099e9b923 
> --docker=/usr/bin/docker --docker_socket=/var/run/docker.sock --help=false 
> --initialize_driver_logging=true --launcher_dir=/usr/libexec/mesos 
> --logbufsecs=0 --logging_level=INFO --mapped_directory=/mnt/mesos/sandbox 
> --quiet=false 
> --sandbox_directory=/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923
>  --stop_timeout=10secs
>     root     12456  0.0  0.8 152636 34548 ?        Sl   10:24   0:00 
> /usr/bin/docker -H unix:///var/run/docker.sock run --cpu-shares 102 
> --cpu-quota 10000 --memory 134217728 -e HOST=10.234.172.56 -e 
> MARATHON_APP_DOCKER_IMAGE=debian -e MARATHON_APP_ID=/sleep -e 
> MARATHON_APP_LABELS=HOLLOWMAN_DEFAULT_SCALE -e 
> MARATHON_APP_LABEL_HOLLOWMAN_DEFAULT_SCALE=1 -e 
> MARATHON_APP_RESOURCE_CPUS=0.1 -e MARATHON_APP_RESOURCE_DISK=0.0 -e 
> MARATHON_APP_RESOURCE_GPUS=0 -e MARATHON_APP_RESOURCE_MEM=128.0 -e 
> MARATHON_APP_VERSION=2019-12-05T13:24:18.206Z -e 
> MESOS_CONTAINER_NAME=mesos-53ec0ef3-3290-476a-b2b6-385099e9b923 -e 
> MESOS_SANDBOX=/mnt/mesos/sandbox -e 
> MESOS_TASK_ID=sleep.8c187c41-1762-11ea-a2e5-02429217540f -v 
> /var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923:/mnt/mesos/sandbox
>  --net host --entrypoint /bin/sh --name 
> mesos-53ec0ef3-3290-476a-b2b6-385099e9b923 
> --label=hollowman.appname=/asgard/sleep 
> --label=MESOS_TASK_ID=sleep.8c187c41-1762-11ea-a2e5-02429217540f debian -c 
> sleep 99d
>     root     16776  0.0  0.0   7960   812 pts/3    S+   10:34   0:00 tail -f 
> /var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923/stderr
> {noformat}
> Here we have 3 processes:
>  - Mesos executor;
>  - The container that this executor started, running {{sleep 99d}};
>  - {{tail -f}} looking into {{stderr}} of this executor.
> The {{stderr}} content is as follows:
> {noformat}
>     $ tail -f 
> /var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923/stderr
>     I1205 13:24:24.337672 12435 exec.cpp:162] Version: 1.7.3
>     I1205 13:24:24.347107 12449 exec.cpp:236] Executor registered on agent 
> 79ad3a13-b567-4273-ac8c-30378d35a439-S60499
>     I1205 13:24:24.349907 12451 executor.cpp:130] Registered docker executor 
> on 10.234.172.56
>     I1205 13:24:24.350291 12449 executor.cpp:186] Starting task 
> sleep.8c187c41-1762-11ea-a2e5-02429217540f
>     WARNING: Your kernel does not support swap limit capabilities or the 
> cgroup is not mounted. Memory limited without swap.
>     W1205 13:24:29.363168 12455 executor.cpp:253] Docker inspect timed out 
> after 5secs for container 'mesos-53ec0ef3-3290-476a-b2b6-385099e9b923'
> {noformat}
> While Mesos Agent is still running it is possible to connect on this executor 
> port.
> {noformat}
>     telnet 10.234.172.56 16653
>     Trying 10.234.172.56...
>     Connected to 10.234.172.56.
>     Escape character is '^]'.
>     ^]
>     telnet> Connection closed.
> {noformat}
> As soon as the Agent shutsdown (a simple {{docker stop}}), this line appears 
> on the {{stderr}} log of tyhe executor:
> {noformat}
>     I1205 13:40:31.160290 12452 exec.cpp:518] Agent exited, but framework has 
> checkpointing enabled. Waiting 15mins to reconnect with agent 
> 79ad3a13-b567-4273-ac8c-30378d35a439-S60499
> {noformat}
> And now, looking ate the O.S process list, the executor process is gone:
> {noformat}
>     $ ps aux | grep executor
>     root     16776  0.0  0.0   7960   812 pts/3    S+   10:34   0:00 tail -f 
> /var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923/stderr
> {noformat}
> and we can't telnet to that port anymore:
> {noformat}
>     telnet 10.234.172.56 16653
>     Trying 10.234.172.56...
>     telnet: Unable to connect to remote host: Connection refused
> {noformat}
> When we run the Mesos Agent again, we see this:
> {noformat}
>     mesos-slave_1  | I1205 13:42:39.925992 18421 docker.cpp:899] Recovering 
> Docker containers
>     mesos-slave_1  | I1205 13:42:44.006150 18421 docker.cpp:912] Got the list 
> of Docker containers
>     mesos-slave_1  | I1205 13:42:44.010700 18421 docker.cpp:1009] Recovering 
> container '53ec0ef3-3290-476a-b2b6-385099e9b923' for executor 
> 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
>     mesos-slave_1  | W1205 13:42:44.011874 18421 docker.cpp:1053] Failed to 
> connect to executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000: Failed to connect to 
> 10.234.172.56:16653: Connection refused
>     mesos-slave_1  | I1205 13:42:44.016306 18421 docker.cpp:2564] Executor 
> for container 53ec0ef3-3290-476a-b2b6-385099e9b923 has exited
>     mesos-slave_1  | I1205 13:42:44.016412 18421 docker.cpp:2335] Destroying 
> container 53ec0ef3-3290-476a-b2b6-385099e9b923 in RUNNING state
>     mesos-slave_1  | I1205 13:42:44.016705 18421 docker.cpp:1133] Finished 
> processing orphaned Docker containers
>     mesos-slave_1  | I1205 13:42:44.017102 18421 docker.cpp:2385] Running 
> docker stop on container 53ec0ef3-3290-476a-b2b6-385099e9b923
>     mesos-slave_1  | I1205 13:42:44.017251 18426 slave.cpp:7205] Recovering 
> executors
>     mesos-slave_1  | I1205 13:42:44.017493 18426 slave.cpp:7229] Sending 
> reconnect request to executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of 
> framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 at 
> executor(1)@10.234.172.56:16653
>     mesos-slave_1  | W1205 13:42:44.018522 18426 process.cpp:1890] Failed to 
> send 'mesos.internal.ReconnectExecutorMessage' to '10.234.172.56:16653', 
> connect: Failed to connect to 10.234.172.56:16653: Connection refused
>     mesos-slave_1  | I1205 13:42:44.019569 18422 slave.cpp:6336] Executor 
> 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 has terminated with unknown status
>     mesos-slave_1  | I1205 13:42:44.022147 18422 slave.cpp:5310] Handling 
> status update TASK_FAILED (Status UUID: c946cd2b-1dec-4fc3-9ea6-501b40dcf4a7) 
> for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 from @0.0.0.0:0
>     mesos-slave_1  | W1205 13:42:44.023823 18423 docker.cpp:1672] Ignoring 
> updating unknown container 53ec0ef3-3290-476a-b2b6-385099e9b923
>     mesos-slave_1  | I1205 13:42:44.024247 18425 
> task_status_update_manager.cpp:328] Received task status update TASK_FAILED 
> (Status UUID: c946cd2b-1dec-4fc3-9ea6-501b40dcf4a7) for task 
> sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
>     mesos-slave_1  | I1205 13:42:44.024363 18425 
> task_status_update_manager.cpp:842] Checkpointing UPDATE for task status 
> update TASK_FAILED (Status UUID: c946cd2b-1dec-4fc3-9ea6-501b40dcf4a7) for 
> task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
>     mesos-slave_1  | I1205 13:42:46.020193 18423 slave.cpp:5238] Cleaning up 
> un-reregistered executors
>     mesos-slave_1  | I1205 13:42:46.020882 18423 slave.cpp:7358] Finished 
> recovery
> {noformat}
> And then a new container is scheduled to run. Marathon shows a status of 
> {{TASK_FAILED}} with message {{Container terminated}} for the previous task.
> Special attention to the part where the Agent seems to try to connect to the 
> executor port:
> {noformat}
>     mesos-slave_1  | I1205 13:42:44.017493 18426 slave.cpp:7229] Sending 
> reconnect request to executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of 
> framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 at 
> executor(1)@10.234.172.56:16653
>     mesos-slave_1  | W1205 13:42:44.018522 18426 process.cpp:1890] Failed to 
> send 'mesos.internal.ReconnectExecutorMessage' to '10.234.172.56:16653', 
> connect: Failed to connect to 10.234.172.56:16653: Connection refused
> {noformat}
> Looking at the mesos Agent recovery documentation the setup seems very 
> straight forward. I'm using the default {{mesos-docker-executor}} to run all 
> docker tasks. Is there any detail that I'm not seeing or is mis-configurated 
> for my setup?
> Can anyone confirm that the {{mesos-docker-executor}} is capable of doing 
> task recovery?
> Thank you very much and if you need any additional information, let me know.
> Thanks,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to