[ https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989728#comment-16989728 ]
Andrei Budnik commented on MESOS-10066: --------------------------------------- Could you please attach full agent logs? > mesos-docker-executor process dies when agent stops. Recovery fails when > agent returns > -------------------------------------------------------------------------------------- > > Key: MESOS-10066 > URL: https://issues.apache.org/jira/browse/MESOS-10066 > Project: Mesos > Issue Type: Bug > Components: agent, containerization, docker, executor > Affects Versions: 1.7.3 > Reporter: Dalton Matos Coelho Barreto > Priority: Critical > > Hello all, > The documentation about Agent Recovery shows two conditions for the recovery > to be possible: > - The agent must have recovery enabled (default true?); > - The scheduler must register itself saying that it has checkpointing > enabled. > In my tests I'm using Marathon as the scheduler and Mesos itself sees > Marathon as e checkpoint-enabled scheduler: > {noformat} > $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, > "id": .id, "checkpoint": .checkpoint, "active": .active}' > { > "name": "asgard-chronos", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001", > "checkpoint": true, > "active": true > } > { > "name": "marathon", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000", > "checkpoint": true, > "active": true > } > }} > {noformat} > Here is what I'm using: > # Mesos Master, 1.4.1 > # Mesos Agent 1.7.3 > # Using docker image {{mesos/mesos-centos:1.7.x}} > # Docker sock mounted from the host > # Docker binary also mounted from the host > # Marathon: 1.4.12 > # Docker > {noformat} > Client: Docker Engine - Community > Version: 19.03.5 > API version: 1.39 (downgraded from 1.40) > Go version: go1.12.12 > Git commit: 633a0ea838 > Built: Wed Nov 13 07:22:05 2019 > OS/Arch: linux/amd64 > Experimental: false > > Server: Docker Engine - Community > Engine: > Version: 18.09.2 > API version: 1.39 (minimum version 1.12) > Go version: go1.10.6 > Git commit: 6247962 > Built: Sun Feb 10 03:42:13 2019 > OS/Arch: linux/amd64 > Experimental: false > {noformat} > h2. The problem > Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} > docker image. > {noformat} > { > "id": "/sleep", > "cmd": "sleep 99d", > "cpus": 0.1, > "mem": 128, > "disk": 0, > "instances": 1, > "constraints": [], > "acceptedResourceRoles": [ > "*" > ], > "container": { > "type": "DOCKER", > "volumes": [], > "docker": { > "image": "debian", > "network": "HOST", > "privileged": false, > "parameters": [], > "forcePullImage": true > } > }, > "labels": {}, > "portDefinitions": [] > } > {noformat} > This task runs fine and get scheduled on the right agent, which is running > mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}). > Here is a sample log: > {noformat} > mesos-slave_1 | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 > mesos-slave_1 | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 > mesos-slave_1 | I1205 13:24:21.392895 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394399 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394918 19849 slave.cpp:9068] Launching > executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 with resources > [{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":"mem","scalar":{"value":32.0},"type":"SCALAR"}] > in work directory > '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.396499 19849 slave.cpp:3078] Queued task > 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for executor > 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 > mesos-slave_1 | I1205 13:24:21.397038 19849 slave.cpp:3526] Launching > container 53ec0ef3-3290-476a-b2b6-385099e9b923 for executor > 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 > mesos-slave_1 | I1205 13:24:21.398028 19846 docker.cpp:1177] Starting > container '53ec0ef3-3290-476a-b2b6-385099e9b923' for task > 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' (and executor > 'sleep.8c187c41-1762-11ea-a2e5-02429217540f') of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 > mesos-slave_1 | W1205 13:24:22.576869 19846 slave.cpp:8496] Failed to > get resource statistics for executor > 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000: Failed to run '/usr/bin/docker -H > unix:///var/run/docker.sock inspect --type=container > mesos-53ec0ef3-3290-476a-b2b6-385099e9b923': exited with status 1; > stderr='Error: No such container: mesos-53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:24.094985 19853 docker.cpp:792] > Checkpointing pid 12435 to > '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923/pids/forked.pid' > mesos-slave_1 | I1205 13:24:24.343099 19848 slave.cpp:4839] Got > registration for executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of > framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 from > executor(1)@10.234.172.56:16653 > mesos-slave_1 | I1205 13:24:24.345593 19848 docker.cpp:1685] Ignoring > updating container 53ec0ef3-3290-476a-b2b6-385099e9b923 because resources > passed to update are identical to existing resources > mesos-slave_1 | I1205 13:24:24.345945 19848 slave.cpp:3296] Sending > queued task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' to executor > 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 at executor(1)@10.234.172.56:16653 > mesos-slave_1 | I1205 13:24:24.362699 19853 slave.cpp:5310] Handling > status update TASK_STARTING (Status UUID: > 8b06fe8c-d709-453b-817d-2948e50782c9) for task > sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 from executor(1)@10.234.172.56:16653 > mesos-slave_1 | I1205 13:24:24.363222 19853 > task_status_update_manager.cpp:328] Received task status update TASK_STARTING > (Status UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) for task > sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 > mesos-slave_1 | I1205 13:24:24.364115 19853 > task_status_update_manager.cpp:842] Checkpointing UPDATE for task status > update TASK_STARTING (Status UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) for > task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 > mesos-slave_1 | I1205 13:24:24.364993 19847 slave.cpp:5815] Forwarding > the update TASK_STARTING (Status UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) > for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 to master@10.234.172.34:5050 > mesos-slave_1 | I1205 13:24:24.365594 19847 slave.cpp:5726] Sending > acknowledgement for status update TASK_STARTING (Status UUID: > 8b06fe8c-d709-453b-817d-2948e50782c9) for task > sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 to executor(1)@10.234.172.56:16653 > mesos-slave_1 | I1205 13:24:24.401759 19846 > task_status_update_manager.cpp:401] Received task status update > acknowledgement (UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) for task > sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 > mesos-slave_1 | I1205 13:24:24.401926 19846 > task_status_update_manager.cpp:842] Checkpointing ACK for task status update > TASK_STARTING (Status UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) for task > sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 > mesos-slave_1 | I1205 13:24:30.481829 19850 slave.cpp:5310] Handling > status update TASK_RUNNING (Status UUID: > 480abd48-6a85-4703-a2f2-0455f5c4c053) for task > sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 from executor(1)@10.234.172.56:16653 > mesos-slave_1 | I1205 13:24:30.482340 19848 > task_status_update_manager.cpp:328] Received task status update TASK_RUNNING > (Status UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) for task > sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 > mesos-slave_1 | I1205 13:24:30.482550 19848 > task_status_update_manager.cpp:842] Checkpointing UPDATE for task status > update TASK_RUNNING (Status UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) for > task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 > mesos-slave_1 | I1205 13:24:30.483163 19848 slave.cpp:5815] Forwarding > the update TASK_RUNNING (Status UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) > for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 to master@10.234.172.34:5050 > mesos-slave_1 | I1205 13:24:30.483664 19848 slave.cpp:5726] Sending > acknowledgement for status update TASK_RUNNING (Status UUID: > 480abd48-6a85-4703-a2f2-0455f5c4c053) for task > sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 to executor(1)@10.234.172.56:16653 > mesos-slave_1 | I1205 13:24:30.557307 19852 > task_status_update_manager.cpp:401] Received task status update > acknowledgement (UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) for task > sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 > mesos-slave_1 | I1205 13:24:30.557467 19852 > task_status_update_manager.cpp:842] Checkpointing ACK for task status update > TASK_RUNNING (Status UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) for task > sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 > {noformat} > An important part of this log: > {noformat} > Sending acknowledgement for status update TASK_RUNNING for task > sleep.8c187c41-1762-11ea-a2e5-02429217540f to executor(1)@10.234.172.56:16653 > {noformat} > Here we have the executor's address that is responsible for our newly created > task: {{executor(1)@10.234.172.56:16653}}. > Looking at the O.S process list we see one instance of > {{mesos-docker-executor}}: > {noformat} > ps aux | grep executor > root 12435 0.5 1.3 851796 54100 ? Ssl 10:24 0:03 > mesos-docker-executor --cgroups_enable_cfs=true > --container=mesos-53ec0ef3-3290-476a-b2b6-385099e9b923 > --docker=/usr/bin/docker --docker_socket=/var/run/docker.sock --help=false > --initialize_driver_logging=true --launcher_dir=/usr/libexec/mesos > --logbufsecs=0 --logging_level=INFO --mapped_directory=/mnt/mesos/sandbox > --quiet=false > --sandbox_directory=/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923 > --stop_timeout=10secs > root 12456 0.0 0.8 152636 34548 ? Sl 10:24 0:00 > /usr/bin/docker -H unix:///var/run/docker.sock run --cpu-shares 102 > --cpu-quota 10000 --memory 134217728 -e HOST=10.234.172.56 -e > MARATHON_APP_DOCKER_IMAGE=debian -e MARATHON_APP_ID=/sleep -e > MARATHON_APP_LABELS=HOLLOWMAN_DEFAULT_SCALE -e > MARATHON_APP_LABEL_HOLLOWMAN_DEFAULT_SCALE=1 -e > MARATHON_APP_RESOURCE_CPUS=0.1 -e MARATHON_APP_RESOURCE_DISK=0.0 -e > MARATHON_APP_RESOURCE_GPUS=0 -e MARATHON_APP_RESOURCE_MEM=128.0 -e > MARATHON_APP_VERSION=2019-12-05T13:24:18.206Z -e > MESOS_CONTAINER_NAME=mesos-53ec0ef3-3290-476a-b2b6-385099e9b923 -e > MESOS_SANDBOX=/mnt/mesos/sandbox -e > MESOS_TASK_ID=sleep.8c187c41-1762-11ea-a2e5-02429217540f -v > /var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923:/mnt/mesos/sandbox > --net host --entrypoint /bin/sh --name > mesos-53ec0ef3-3290-476a-b2b6-385099e9b923 > --label=hollowman.appname=/asgard/sleep > --label=MESOS_TASK_ID=sleep.8c187c41-1762-11ea-a2e5-02429217540f debian -c > sleep 99d > root 16776 0.0 0.0 7960 812 pts/3 S+ 10:34 0:00 tail -f > /var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923/stderr > {noformat} > Here we have 3 processes: > - Mesos executor; > - The container that this executor started, running {{sleep 99d}}; > - {{tail -f}} looking into {{stderr}} of this executor. > The {{stderr}} content is as follows: > {noformat} > $ tail -f > /var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923/stderr > I1205 13:24:24.337672 12435 exec.cpp:162] Version: 1.7.3 > I1205 13:24:24.347107 12449 exec.cpp:236] Executor registered on agent > 79ad3a13-b567-4273-ac8c-30378d35a439-S60499 > I1205 13:24:24.349907 12451 executor.cpp:130] Registered docker executor > on 10.234.172.56 > I1205 13:24:24.350291 12449 executor.cpp:186] Starting task > sleep.8c187c41-1762-11ea-a2e5-02429217540f > WARNING: Your kernel does not support swap limit capabilities or the > cgroup is not mounted. Memory limited without swap. > W1205 13:24:29.363168 12455 executor.cpp:253] Docker inspect timed out > after 5secs for container 'mesos-53ec0ef3-3290-476a-b2b6-385099e9b923' > {noformat} > While Mesos Agent is still running it is possible to connect on this executor > port. > {noformat} > telnet 10.234.172.56 16653 > Trying 10.234.172.56... > Connected to 10.234.172.56. > Escape character is '^]'. > ^] > telnet> Connection closed. > {noformat} > As soon as the Agent shutsdown (a simple {{docker stop}}), this line appears > on the {{stderr}} log of tyhe executor: > {noformat} > I1205 13:40:31.160290 12452 exec.cpp:518] Agent exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with agent > 79ad3a13-b567-4273-ac8c-30378d35a439-S60499 > {noformat} > And now, looking ate the O.S process list, the executor process is gone: > {noformat} > $ ps aux | grep executor > root 16776 0.0 0.0 7960 812 pts/3 S+ 10:34 0:00 tail -f > /var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923/stderr > {noformat} > and we can't telnet to that port anymore: > {noformat} > telnet 10.234.172.56 16653 > Trying 10.234.172.56... > telnet: Unable to connect to remote host: Connection refused > {noformat} > When we run the Mesos Agent again, we see this: > {noformat} > mesos-slave_1 | I1205 13:42:39.925992 18421 docker.cpp:899] Recovering > Docker containers > mesos-slave_1 | I1205 13:42:44.006150 18421 docker.cpp:912] Got the list > of Docker containers > mesos-slave_1 | I1205 13:42:44.010700 18421 docker.cpp:1009] Recovering > container '53ec0ef3-3290-476a-b2b6-385099e9b923' for executor > 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 > mesos-slave_1 | W1205 13:42:44.011874 18421 docker.cpp:1053] Failed to > connect to executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000: Failed to connect to > 10.234.172.56:16653: Connection refused > mesos-slave_1 | I1205 13:42:44.016306 18421 docker.cpp:2564] Executor > for container 53ec0ef3-3290-476a-b2b6-385099e9b923 has exited > mesos-slave_1 | I1205 13:42:44.016412 18421 docker.cpp:2335] Destroying > container 53ec0ef3-3290-476a-b2b6-385099e9b923 in RUNNING state > mesos-slave_1 | I1205 13:42:44.016705 18421 docker.cpp:1133] Finished > processing orphaned Docker containers > mesos-slave_1 | I1205 13:42:44.017102 18421 docker.cpp:2385] Running > docker stop on container 53ec0ef3-3290-476a-b2b6-385099e9b923 > mesos-slave_1 | I1205 13:42:44.017251 18426 slave.cpp:7205] Recovering > executors > mesos-slave_1 | I1205 13:42:44.017493 18426 slave.cpp:7229] Sending > reconnect request to executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of > framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 at > executor(1)@10.234.172.56:16653 > mesos-slave_1 | W1205 13:42:44.018522 18426 process.cpp:1890] Failed to > send 'mesos.internal.ReconnectExecutorMessage' to '10.234.172.56:16653', > connect: Failed to connect to 10.234.172.56:16653: Connection refused > mesos-slave_1 | I1205 13:42:44.019569 18422 slave.cpp:6336] Executor > 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 has terminated with unknown status > mesos-slave_1 | I1205 13:42:44.022147 18422 slave.cpp:5310] Handling > status update TASK_FAILED (Status UUID: c946cd2b-1dec-4fc3-9ea6-501b40dcf4a7) > for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 from @0.0.0.0:0 > mesos-slave_1 | W1205 13:42:44.023823 18423 docker.cpp:1672] Ignoring > updating unknown container 53ec0ef3-3290-476a-b2b6-385099e9b923 > mesos-slave_1 | I1205 13:42:44.024247 18425 > task_status_update_manager.cpp:328] Received task status update TASK_FAILED > (Status UUID: c946cd2b-1dec-4fc3-9ea6-501b40dcf4a7) for task > sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 > mesos-slave_1 | I1205 13:42:44.024363 18425 > task_status_update_manager.cpp:842] Checkpointing UPDATE for task status > update TASK_FAILED (Status UUID: c946cd2b-1dec-4fc3-9ea6-501b40dcf4a7) for > task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 > mesos-slave_1 | I1205 13:42:46.020193 18423 slave.cpp:5238] Cleaning up > un-reregistered executors > mesos-slave_1 | I1205 13:42:46.020882 18423 slave.cpp:7358] Finished > recovery > {noformat} > And then a new container is scheduled to run. Marathon shows a status of > {{TASK_FAILED}} with message {{Container terminated}} for the previous task. > Special attention to the part where the Agent seems to try to connect to the > executor port: > {noformat} > mesos-slave_1 | I1205 13:42:44.017493 18426 slave.cpp:7229] Sending > reconnect request to executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of > framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 at > executor(1)@10.234.172.56:16653 > mesos-slave_1 | W1205 13:42:44.018522 18426 process.cpp:1890] Failed to > send 'mesos.internal.ReconnectExecutorMessage' to '10.234.172.56:16653', > connect: Failed to connect to 10.234.172.56:16653: Connection refused > {noformat} > Looking at the mesos Agent recovery documentation the setup seems very > straight forward. I'm using the default {{mesos-docker-executor}} to run all > docker tasks. Is there any detail that I'm not seeing or is mis-configurated > for my setup? > Can anyone confirm that the {{mesos-docker-executor}} is capable of doing > task recovery? > Thank you very much and if you need any additional information, let me know. > Thanks, -- This message was sent by Atlassian Jira (v8.3.4#803005)