[ 
https://issues.apache.org/jira/browse/MESOS-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Elizabeth Lingg resolved MESOS-2605.
------------------------------------
    Resolution: Fixed

Tested this with 0.22.1 RC6, and it seems to be fixed.

> The slave sometimes does not send active executors during reregistration
> ------------------------------------------------------------------------
>
>                 Key: MESOS-2605
>                 URL: https://issues.apache.org/jira/browse/MESOS-2605
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.22.0
>            Reporter: Elizabeth Lingg
>              Labels: mesosphere
>
> The slave sometimes does not send active executors during reregistration. 
> Framework checkpointing is enabled, and the executor successfully 
> reregisters. However, the tasks in that executor are LOST (by abnormal 
> executor termination) because the executor is removed by the mesos master as 
> unknown. See the example below, 
> task.journalnode.journalnode.NodeExecutor.1428609184051.
> See the Slave Logs here for the Task:
> {code}
> Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 
> 19:53:06.778790 25126 status_update_manager.cpp:317] Received status update 
> TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task 
> task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 
> 20150408-002100-4261056010-5050-1047-0008
> Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 
> 19:53:06.779013 25126 status_update_manager.hpp:346] Checkpointing UPDATE for 
> status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for 
> task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 
> 20150408-002100-4261056010-5050-1047-0008
> Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 
> 19:53:06.781788 25123 slave.cpp:2753] Forwarding the update TASK_RUNNING 
> (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task 
> task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 
> 20150408-002100-4261056010-5050-1047-0008 to master@10.142.250.253:5050
> Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 
> 19:53:06.781889 25123 slave.cpp:2686] Sending acknowledgement for status 
> update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task 
> task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 
> 20150408-002100-4261056010-5050-1047-0008 to executor(1)@10.168.119.78:47638
> Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 
> 19:53:06.784503 25124 status_update_manager.cpp:389] Received status update 
> acknowledgement (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for task 
> task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 
> 20150408-002100-4261056010-5050-1047-0008
> Apr 09 19:53:06 ip-10-168-119-78.ec2.internal mesos-slave[25116]: I0409 
> 19:53:06.784567 25124 status_update_manager.hpp:346] Checkpointing ACK for 
> status update TASK_RUNNING (UUID: 4eb22075-c319-463d-8f70-94db9caa69c6) for 
> task task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 
> 20150408-002100-4261056010-5050-1047-0008
> {code}
> Master Logs:
> {code}
> Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: W0409 
> 20:19:43.008666  1067 master.cpp:4015] Executor 
> executor.journalnode.NodeExecutor.1428609184051 of framework 
> 20150408-002100-4261056010-5050-1047-0008 possibly unknown to the slave 
> 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 
> (ec2-54-237-57-237.compute-1.amazonaws.com)
> Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 
> 20:19:43.008652  1074 hierarchical.hpp:648] Recovered cpus(*):0.1; 
> mem(*):1536 (total allocatable: cpus(*):3.5; mem(*):21113; disk(*):142210; 
> ports(*):[3889-5044, 5046-5049, 2182-2958, 2960-3887, 1025-2180, 8082-9041, 
> 9043-9159, 9161-9999, 5052-6999, 7002-7198, 7200-8079, 10001-65535]) on slave 
> 20150407-233647-2059219722-5050-1659-S5 from framework 
> 20150408-002100-4261056010-5050-1047-0008
> Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 
> 20:19:43.008712  1067 master.cpp:4714] Removing executor 
> 'executor.journalnode.NodeExecutor.1428609184051' with resources cpus(*):0.1; 
> mem(*):1536 of framework 20150408-002100-4261056010-5050-1047-0008 on slave 
> 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 
> (ec2-54-237-57-237.compute-1.amazonaws.com)
> Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 
> 20:19:43.010372  1067 master.cpp:3295] Status update TASK_LOST (UUID: 
> e5532567-e5b2-4fca-87aa-f3f98e371640) for task 
> task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 
> 20150408-002100-4261056010-5050-1047-0008 from slave 
> 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 
> (ec2-54-237-57-237.compute-1.amazonaws.com)
> Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 
> 20:19:43.013746  1067 master.cpp:3295] Status update TASK_LOST (UUID: 
> e5532567-e5b2-4fca-87aa-f3f98e371640) for task 
> task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 
> 20150408-002100-4261056010-5050-1047-0008 from slave 
> 20150407-233647-2059219722-5050-1659-S5 at slave(1)@10.168.119.78:5051 
> (ec2-54-237-57-237.compute-1.amazonaws.com)
> Apr 09 20:19:43 ip-10-142-250-253.ec2.internal mesos-master[1047]: I0409 
> 20:19:43.013767  1067 master.cpp:3336] Forwarding status update TASK_LOST 
> (UUID: e5532567-e5b2-4fca-87aa-f3f98e371640) for task 
> task.journalnode.journalnode.NodeExecutor.1428609184051 of framework 
> 20150408-002100-4261056010-5050-1047-0008
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to