[ https://issues.apache.org/jira/browse/MESOS-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Elizabeth Lingg resolved MESOS-2605. ------------------------------------ Resolution: Fixed Tested this with 0.22.1 RC6, and it seems to be fixed. > The slave sometimes does not send active executors during reregistration > ------------------------------------------------------------------------ > > Key: MESOS-2605 > URL: https://issues.apache.org/jira/browse/MESOS-2605 > Project: Mesos > Issue Type: Bug > Affects Versions: 0.22.0 > Reporter: Elizabeth Lingg > Labels: mesosphere > > The slave sometimes does not send active executors during reregistration. > Framework checkpointing is enabled, and the executor successfully > reregisters. However, the tasks in that executor are LOST (by abnormal > executor termination) because the executor is removed by the mesos master as > unknown. See the example below, > task.journalnode.journalnode.NodeExecutor.1428609184051. > See the Slave Logs here for the Task: > {code} > Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 > 19:53:06.778790 25126 status_update_manager.cpp:317] Received status update > TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task > task.journalnode.journalnode.NodeExecutor.1428609184051 of framework > 20150408-002100-4261056010-5050-1047-0008 > Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 > 19:53:06.779013 25126 status_update_manager.hpp:346] Checkpointing UPDATE for > status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for > task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework > 20150408-002100-4261056010-5050-1047-0008 > Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 > 19:53:06.781788 25123 slave.cpp:2753] Forwarding the update TASK_RUNNING > (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task > task.journalnode.journalnode.NodeExecutor.1428609184051 of framework > 20150408-002100-4261056010-5050-1047-0008 to master@10.142.250.253:5050 > Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 > 19:53:06.781889 25123 slave.cpp:2686] Sending acknowledgement for status > update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task > task.journalnode.journalnode.NodeExecutor.1428609184051 of framework > 20150408-002100-4261056010-5050-1047-0008 to executor(1)@10.168.119.78:47638 > Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 > 19:53:06.784503 25124 status_update_manager.cpp:389] Received status update > acknowledgement (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task > task.journalnode.journalnode.NodeExecutor.1428609184051 of framework > 20150408-002100-4261056010-5050-1047-0008 > Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 > 19:53:06.784567 25124 status_update_manager.hpp:346] Checkpointing ACK for > status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for > task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework > 20150408-002100-4261056010-5050-1047-0008 > {code} > Master Logs: > {code} > Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: W0409 > 20:19:43.008666 1067 master.cpp:4015] Executor > executor.journalnode.NodeExecutor.1428609184051 of framework > 20150408-002100-4261056010-5050-1047-0008 possibly unknown to the slave > 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 > (ec2-54-237-57-237.compute-1.amazonaws.com) > Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 > 20:19:43.008652 1074 hierarchical.hpp:648] Recovered cpus(*):0.1; > mem(*):1536 (total allocatable: cpus(*):3.5; mem(*):21113; disk(*):142210; > ports(*):[3889-5044, 5046-5049, 2182-2958, 2960-3887, 1025-2180, 8082-9041, > 9043-9159, 9161-9999, 5052-6999, 7002-7198, 7200-8079, 10001-65535]) on slave > 20150407-233647-2059219722-5050-1659-S5 from framework > 20150408-002100-4261056010-5050-1047-0008 > Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 > 20:19:43.008712 1067 master.cpp:4714] Removing executor > 'executor.journalnode.NodeExecutor.1428609184051' with resources cpus(*):0.1; > mem(*):1536 of framework 20150408-002100-4261056010-5050-1047-0008 on slave > 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 > (ec2-54-237-57-237.compute-1.amazonaws.com) > Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 > 20:19:43.010372 1067 master.cpp:3295] Status update TASK_LOST (UUID: > e5532567-e5b2-4fca-87aa-f3f98e371640) for task > task.journalnode.journalnode.NodeExecutor.1428609184051 of framework > 20150408-002100-4261056010-5050-1047-0008 from slave > 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 > (ec2-54-237-57-237.compute-1.amazonaws.com) > Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 > 20:19:43.013746 1067 master.cpp:3295] Status update TASK_LOST (UUID: > e5532567-e5b2-4fca-87aa-f3f98e371640) for task > task.journalnode.journalnode.NodeExecutor.1428609184051 of framework > 20150408-002100-4261056010-5050-1047-0008 from slave > 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 > (ec2-54-237-57-237.compute-1.amazonaws.com) > Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 > 20:19:43.013767 1067 master.cpp:3336] Forwarding status update TASK_LOST > (UUID: e5532567-e5b2-4fca-87aa-f3f98e371640) for task > task.journalnode.journalnode.NodeExecutor.1428609184051 of framework > 20150408-002100-4261056010-5050-1047-0008 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)