Hey folks,

Occassionally, a task runs normally and finishes, but the slave ignores the
registration timeout because it thinks the framework has exited.  In my
example below, the task is "echo hi" in chronos.  I've also observed this
with other frameworks, such as Hadoop.  This is a problem for chronos tasks
in particular, because after this occurs I'm unable to retrieve the task
logs via the Mesos web UI.  These task logs are very helpful for debugging
chronos jobs.

Here's the log from the master followed by the slave log:

I0805 19:36:06.977514 16379 master.hpp:311] Adding task
> ct:1375731364150:0:level2 with resources cpus(*):1; mem(*):1024;
> disk(*):1024 on slave 201308022130-200446986-5050-11821-0
> I0805 19:36:06.981220 16379 master.cpp:1718] Launching task
> ct:1375731364150:0:level2 of framework chronos with resources cpus(*):1;
> mem(*):1024; disk(*):1024 on slave 201308022130-200446986-5050-11821-0
> (i-c9b911b0)
> I0805 19:36:07.108693 16373 master.cpp:1130] Status update from slave(1)@
> 10.148.197.225:5051: task ct:1375731364150:0:level2 of framework chronos
> is now in state TASK_RUNNING
> I0805 19:36:07.146131 16369 master.cpp:1130] Status update from slave(1)@
> 10.148.164.186:5051: task ct:1375731364150:0:level3 of framework chronos
> is now in state TASK_FINISHED
> I0805 19:36:07.146595 16369 master.hpp:323] Removing task
> ct:1375731364150:0:level3 with resources cpus(*):1; mem(*):1024;
> disk(*):1024 on slave 201308022130-200446986-5050-11821-1
> I0805 19:36:07.147125 16369 hierarchical_allocator_process.hpp:595]
> Recovered cpus(*):1; mem(*):1024; disk(*):1024 (total allocatable:
> cpus(*):1.4; mem(*):1275; disk(*):610649; ports(*):[31002-32000]) on slave
> 201308022130-200446986-5050-11821-1 from framework chronos
> I0805 19:36:07.971992 16363 master.cpp:1366] Sending 1 offers to framework
> chronos
> I0805 19:36:07.973424 16374 master.cpp:1599] Processing reply for offer
> 201308051833-200446986-5050-16348-5068 on slave
> 201308022130-200446986-5050-11821-1 (i-d7b911ae) for framework chronos
> I0805 19:36:07.973973 16367 hierarchical_allocator_process.hpp:548]
> Framework chronos filtered slave 201308022130-200446986-5050-11821-1 for
> 5secs
> I0805 19:36:08.102439 16364 master.cpp:1130] Status update from slave(1)@
> 10.148.197.225:5051: task ct:1375731364150:0:level2 of framework chronos
> is now in state TASK_FINISHED
> I0805 19:36:08.102907 16364 master.hpp:323] Removing task
> ct:1375731364150:0:level2 with resources cpus(*):1; mem(*):1024;
> disk(*):1024 on slave 201308022130-200446986-5050-11821-0



I0805 19:36:06.982079 14728 slave.cpp:747] Got assigned task
> ct:1375731364150:0:level2 for framework chronos
> I0805 19:36:06.983682 14728 gc.cpp:84] Unscheduling
> '/tmp/mesos/slaves/201308022130-200446986-5050-11821-0/frameworks/chronos'
> for removal
> I0805 19:36:06.984241 14746 slave.cpp:845] Launching task
> ct:1375731364150:0:level2 for framework chronos
> I0805 19:36:06.986026 14746 paths.hpp:303] Created executor directory
> '/tmp/mesos/slaves/201308022130-200446986-5050-11821-0/frameworks/chronos/executors/ct:137573136
> 4150:0:level2/runs/7869f30a-7700-4cd7-8bb4-4a8ed618b5ce'
> I0805 19:36:06.986639 14746 slave.cpp:956] Queuing task
> 'ct:1375731364150:0:level2' for executor ct:1375731364150:0:level2 of
> framework 'chronos
> I0805 19:36:06.987416 14749 cgroups_isolator.cpp:513] Launching
> ct:1375731364150:0:level2 (/usr/local/libexec/mesos/mesos-executor) in
> /tmp/mesos/slaves/201308022130-
> 200446986-5050-11821-0/frameworks/chronos/executors/ct:1375731364150:0:level2/runs/7869f30a-7700-4cd7-8bb4-4a8ed618b5ce
> with resources  for framework chronos in cgrou
> p
> mesos/framework_chronos_executor_ct:1375731364150:0:level2_tag_7869f30a-7700-4cd7-8bb4-4a8ed618b5ce
> I0805 19:36:06.988970 14746 slave.cpp:519] Successfully attached file
> '/tmp/mesos/slaves/201308022130-200446986-5050-11821-0/frameworks/chronos/executors/ct:137573136
> 4150:0:level2/runs/7869f30a-7700-4cd7-8bb4-4a8ed618b5ce'
> I0805 19:36:06.991384 14749 cgroups_isolator.cpp:656] Changing cgroup
> controls for executor ct:1375731364150:0:level2 of framework chronos with
> resources
> I0805 19:36:06.993352 14749 cgroups_isolator.cpp:1053] Started listening
> for OOM events for executor ct:1375731364150:0:level2 of framework chronos
> I0805 19:36:06.995985 14749 cgroups_isolator.cpp:543] Forked executor at =
> 25602
> I0805 19:36:07.099602 14739 slave.cpp:1423] Got registration for executor
> 'ct:1375731364150:0:level2' of framework chronos
> I0805 19:36:07.100589 14739 slave.cpp:1538] Flushing queued task
> ct:1375731364150:0:level2 for executor 'ct:1375731364150:0:level2' of
> framework chronos
> I0805 19:36:07.100688 14738 cgroups_isolator.cpp:656] Changing cgroup
> controls for executor ct:1375731364150:0:level2 of framework chronos with
> resources cpus(*):1; m
> em(*):1024; disk(*):1024
> I0805 19:36:07.102645 14738 cgroups_isolator.cpp:964] Updated
> 'cpu.cfs_period_us' to 100000 and 'cpu.cfs_quota_us' to 100000 for executor
> ct:1375731364150:0:level2 of
>  framework chronos
> I0805 19:36:07.103688 14738 cgroups_isolator.cpp:889] Updated 'cpu.shares'
> to 1024 for executor ct:1375731364150:0:level2 of framework chronos
> I0805 19:36:07.104630 14738 cgroups_isolator.cpp:1027] Updated
> 'memory.soft_limit_in_bytes' to 1073741824 for executor
> ct:1375731364150:0:level2 of framework chronos
> I0805 19:36:07.105954 14728 slave.cpp:1720] Handling status update
> TASK_RUNNING (UUID: 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5) for task
> ct:1375731364150:0:level2 of framework chronos from executor(1)@
> 10.148.197.225:45297
> I0805 19:36:07.106722 14741 status_update_manager.cpp:290] Received status
> update TASK_RUNNING (UUID: 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5) for task
> ct:1375731364150:0:level2 of framework chronos with checkpoint=false
> I0805 19:36:07.107151 14741 status_update_manager.cpp:450] Creating
> StatusUpdate stream for task ct:1375731364150:0:level2 of framework chronos
> I0805 19:36:07.107681 14741 status_update_manager.cpp:336] Forwarding
> status update TASK_RUNNING (UUID: 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5) for
> task ct:1375731364150:0:level2 of framework chronos to
> [email protected]:5050
> I0805 19:36:07.108211 14728 slave.cpp:1837] Sending acknowledgement for
> status update TASK_RUNNING (UUID: 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5) for
> task ct:1375731364150:0:level2 of framework chronos to executor(1)@
> 10.148.197.225:45297
> I0805 19:36:07.110834 14735 status_update_manager.cpp:360] Received status
> update acknowledgement 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5 for task
> ct:1375731364150:0:level2 of framework chronos
> I0805 19:36:07.105954 14728 slave.cpp:1720] Handling status update
> TASK_RUNNING (UUID: 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5) for task
> ct:1375731364150:0:level2 of framework chronos from executor(1)@
> 10.148.197.225:45297
> I0805 19:36:07.106722 14741 status_update_manager.cpp:290] Received status
> update TASK_RUNNING (UUID: 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5) for task
> ct:1375731364150:0:level2 of framework chronos with checkpoint=false
> I0805 19:36:07.107151 14741 status_update_manager.cpp:450] Creating
> StatusUpdate stream for task ct:1375731364150:0:level2 of framework chronos
> I0805 19:36:07.107681 14741 status_update_manager.cpp:336] Forwarding
> status update TASK_RUNNING (UUID: 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5) for
> task ct:1375731364150:0:level2 of framework chronos to
> [email protected]:5050
> I0805 19:36:07.108211 14728 slave.cpp:1837] Sending acknowledgement for
> status update TASK_RUNNING (UUID: 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5) for
> task ct:1375731364150:0:level2 of framework chronos to executor(1)@
> 10.148.197.225:45297
> I0805 19:36:07.110834 14735 status_update_manager.cpp:360] Received status
> update acknowledgement 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5 for task
> ct:1375731364150:0:level2 of framework chronos
> I0805 19:36:08.100355 14741 slave.cpp:1720] Handling status update
> TASK_FINISHED (UUID: 284d4591-06b4-410c-b237-5d740a78ad88) for task
> ct:1375731364150:0:level2 of framework chronos from executor(1)@
> 10.148.197.225:45297
> I0805 19:36:08.100966 14739 status_update_manager.cpp:290] Received status
> update TASK_FINISHED (UUID: 284d4591-06b4-410c-b237-5d740a78ad88) for task
> ct:1375731364150:0:level2 of framework chronos with checkpoint=false
> I0805 19:36:08.101403 14739 status_update_manager.cpp:336] Forwarding
> status update TASK_FINISHED (UUID: 284d4591-06b4-410c-b237-5d740a78ad88)
> for task ct:1375731364150:0:level2 of framework chronos to
> [email protected]:5050
> I0805 19:36:08.101419 14740 cgroups_isolator.cpp:656] Changing cgroup
> controls for executor ct:1375731364150:0:level2 of framework chronos with
> resources
> I0805 19:36:08.102355 14750 slave.cpp:1837] Sending acknowledgement for
> status update TASK_FINISHED (UUID: 284d4591-06b4-410c-b237-5d740a78ad88)
> for task ct:1375731364150:0:level2 of framework chronos to executor(1)@
> 10.148.197.225:45297
> I0805 19:36:08.114512 14727 status_update_manager.cpp:360] Received status
> update acknowledgement 284d4591-06b4-410c-b237-5d740a78ad88 for task
> ct:1375731364150:0:level2 of framework chronos
> I0805 19:36:08.115033 14727 status_update_manager.cpp:481] Cleaning up
> status update stream for task ct:1375731364150:0:level2 of framework chronos
> I0805 19:36:09.707617 14741 cgroups_isolator.cpp:851] Executor
> ct:1375731364150:0:level2 of framework chronos terminated with status 0
> I0805 19:36:09.708048 14741 cgroups_isolator.cpp:619] Killing executor
> ct:1375731364150:0:level2 of framework chronos
> I0805 19:36:09.709691 14741 cgroups_isolator.cpp:1073] OOM notifier is
> triggered for executor ct:1375731364150:0:level2 of framework chronos with
> uuid 7869f30a-7700-4cd7-8bb4-4a8ed618b5ce
> I0805 19:36:09.710186 14741 cgroups_isolator.cpp:1078] Discarded OOM
> notifier for executor ct:1375731364150:0:level2 of framework chronos with
> uuid 7869f30a-7700-4cd7-8bb4-4a8ed618b5ce
> I0805 19:36:09.715747 14741 cgroups.cpp:1193] Trying to freeze cgroup
> /cgroup/mesos/framework_chronos_executor_ct:1375731364150:0:level2_tag_7869f30a-7700-4cd7-8bb4-4a8ed618b5ce
> I0805 19:36:09.718188 14741 cgroups.cpp:1232] Successfully froze cgroup
> /cgroup/mesos/framework_chronos_executor_ct:1375731364150:0:level2_tag_7869f30a-7700-4cd7-8bb4-4a8ed618b5ce
> after 1 attempts
> I0805 19:36:09.724428 14731 cgroups.cpp:1208] Trying to thaw cgroup
> /cgroup/mesos/framework_chronos_executor_ct:1375731364150:0:level2_tag_7869f30a-7700-4cd7-8bb4-4a8ed618b5ce
> I0805 19:36:09.724879 14731 cgroups.cpp:1318] Successfully thawed
> /cgroup/mesos/framework_chronos_executor_ct:1375731364150:0:level2_tag_7869f30a-7700-4cd7-8bb4-4a8ed618b5ce
> I0805 19:36:09.882351 14740 cgroups_isolator.cpp:1192] Successfully
> destroyed cgroup
> mesos/framework_chronos_executor_ct:1375731364150:0:level2_tag_7869f30a-7700-4cd7-8bb4-4a8ed618b5ce
> I0805 19:36:09.884524 14722 slave.cpp:2088] Executor
> 'ct:1375731364150:0:level2' of framework chronos has exited with status 0
> I0805 19:36:09.885027 14722 slave.cpp:2223] Cleaning up executor
> 'ct:1375731364150:0:level2' of framework chronos
> I0805 19:36:09.889178 14740 gc.cpp:56] Scheduling
> '/tmp/mesos/slaves/201308022130-200446986-5050-11821-0/frameworks/chronos/executors/ct:1375731364150:0:level2/runs/7869f30a-7700-4cd7-8bb4-4a8ed618b5ce'
> for removal
> I0805 19:36:09.889292 14722 slave.cpp:2287] Cleaning up framework chronos
> I0805 19:36:09.889802 14740 gc.cpp:56] Scheduling
> '/tmp/mesos/slaves/201308022130-200446986-5050-11821-0/frameworks/chronos/executors/ct:1375731364150:0:level2'
> for removal
> I0805 19:36:09.890043 14728 status_update_manager.cpp:252] Closing status
> update streams for framework chronos
> I0805 19:36:09.890599 14740 gc.cpp:56] Scheduling
> '/tmp/mesos/slaves/201308022130-200446986-5050-11821-0/frameworks/chronos'
> for removal
> I0805 19:36:38.089031 14736 slave.cpp:2539] Current usage 0.47%. Max
> allowed age: 1.790501830619676days
> I0805 19:37:01.998739 14748 slave.cpp:2446] Framework chronos seems to
> have exited. Ignoring registration timeout for executor
> 'ct:1375731357036:0:level3'
> I0805 19:37:04.006137 14735 slave.cpp:2446] Framework chronos seems to
> have exited. Ignoring registration timeout for executor
> 'ct:1375731357119:0:echo hi'
> I0805 19:37:06.987412 14724 slave.cpp:2446] Framework chronos seems to
> have exited. Ignoring registration timeout for executor
> 'ct:1375731364150:0:level2'

Reply via email to