Created https://issues.apache.org/jira/browse/MESOS-616 to directly address this.
On Mon, Aug 5, 2013 at 3:40 PM, Vinod Kone <[email protected]> wrote: > Hey Brenden, > > The "ignoring registration timeout" log message is innocuous (I'll probably > change its wording one of these days). I think there is an outstanding bug > where a framework moved to "completedFrameworks" is not moved back to > "frameworks" when the framework with same id is seen by mesos. > > > On Mon, Aug 5, 2013 at 1:08 PM, Brenden Matthews < > [email protected]> wrote: > > > Hey folks, > > > > Occassionally, a task runs normally and finishes, but the slave ignores > the > > registration timeout because it thinks the framework has exited. In my > > example below, the task is "echo hi" in chronos. I've also observed this > > with other frameworks, such as Hadoop. This is a problem for chronos > tasks > > in particular, because after this occurs I'm unable to retrieve the task > > logs via the Mesos web UI. These task logs are very helpful for > debugging > > chronos jobs. > > > > Here's the log from the master followed by the slave log: > > > > I0805 19:36:06.977514 16379 master.hpp:311] Adding task > > > ct:1375731364150:0:level2 with resources cpus(*):1; mem(*):1024; > > > disk(*):1024 on slave 201308022130-200446986-5050-11821-0 > > > I0805 19:36:06.981220 16379 master.cpp:1718] Launching task > > > ct:1375731364150:0:level2 of framework chronos with resources > cpus(*):1; > > > mem(*):1024; disk(*):1024 on slave 201308022130-200446986-5050-11821-0 > > > (i-c9b911b0) > > > I0805 19:36:07.108693 16373 master.cpp:1130] Status update from > slave(1)@ > > > 10.148.197.225:5051: task ct:1375731364150:0:level2 of framework > chronos > > > is now in state TASK_RUNNING > > > I0805 19:36:07.146131 16369 master.cpp:1130] Status update from > slave(1)@ > > > 10.148.164.186:5051: task ct:1375731364150:0:level3 of framework > chronos > > > is now in state TASK_FINISHED > > > I0805 19:36:07.146595 16369 master.hpp:323] Removing task > > > ct:1375731364150:0:level3 with resources cpus(*):1; mem(*):1024; > > > disk(*):1024 on slave 201308022130-200446986-5050-11821-1 > > > I0805 19:36:07.147125 16369 hierarchical_allocator_process.hpp:595] > > > Recovered cpus(*):1; mem(*):1024; disk(*):1024 (total allocatable: > > > cpus(*):1.4; mem(*):1275; disk(*):610649; ports(*):[31002-32000]) on > > slave > > > 201308022130-200446986-5050-11821-1 from framework chronos > > > I0805 19:36:07.971992 16363 master.cpp:1366] Sending 1 offers to > > framework > > > chronos > > > I0805 19:36:07.973424 16374 master.cpp:1599] Processing reply for offer > > > 201308051833-200446986-5050-16348-5068 on slave > > > 201308022130-200446986-5050-11821-1 (i-d7b911ae) for framework chronos > > > I0805 19:36:07.973973 16367 hierarchical_allocator_process.hpp:548] > > > Framework chronos filtered slave 201308022130-200446986-5050-11821-1 > for > > > 5secs > > > I0805 19:36:08.102439 16364 master.cpp:1130] Status update from > slave(1)@ > > > 10.148.197.225:5051: task ct:1375731364150:0:level2 of framework > chronos > > > is now in state TASK_FINISHED > > > I0805 19:36:08.102907 16364 master.hpp:323] Removing task > > > ct:1375731364150:0:level2 with resources cpus(*):1; mem(*):1024; > > > disk(*):1024 on slave 201308022130-200446986-5050-11821-0 > > > > > > > > I0805 19:36:06.982079 14728 slave.cpp:747] Got assigned task > > > ct:1375731364150:0:level2 for framework chronos > > > I0805 19:36:06.983682 14728 gc.cpp:84] Unscheduling > > > > > > '/tmp/mesos/slaves/201308022130-200446986-5050-11821-0/frameworks/chronos' > > > for removal > > > I0805 19:36:06.984241 14746 slave.cpp:845] Launching task > > > ct:1375731364150:0:level2 for framework chronos > > > I0805 19:36:06.986026 14746 paths.hpp:303] Created executor directory > > > > > > '/tmp/mesos/slaves/201308022130-200446986-5050-11821-0/frameworks/chronos/executors/ct:137573136 > > > 4150:0:level2/runs/7869f30a-7700-4cd7-8bb4-4a8ed618b5ce' > > > I0805 19:36:06.986639 14746 slave.cpp:956] Queuing task > > > 'ct:1375731364150:0:level2' for executor ct:1375731364150:0:level2 of > > > framework 'chronos > > > I0805 19:36:06.987416 14749 cgroups_isolator.cpp:513] Launching > > > ct:1375731364150:0:level2 (/usr/local/libexec/mesos/mesos-executor) in > > > /tmp/mesos/slaves/201308022130- > > > > > > 200446986-5050-11821-0/frameworks/chronos/executors/ct:1375731364150:0:level2/runs/7869f30a-7700-4cd7-8bb4-4a8ed618b5ce > > > with resources for framework chronos in cgrou > > > p > > > > > > mesos/framework_chronos_executor_ct:1375731364150:0:level2_tag_7869f30a-7700-4cd7-8bb4-4a8ed618b5ce > > > I0805 19:36:06.988970 14746 slave.cpp:519] Successfully attached file > > > > > > '/tmp/mesos/slaves/201308022130-200446986-5050-11821-0/frameworks/chronos/executors/ct:137573136 > > > 4150:0:level2/runs/7869f30a-7700-4cd7-8bb4-4a8ed618b5ce' > > > I0805 19:36:06.991384 14749 cgroups_isolator.cpp:656] Changing cgroup > > > controls for executor ct:1375731364150:0:level2 of framework chronos > with > > > resources > > > I0805 19:36:06.993352 14749 cgroups_isolator.cpp:1053] Started > listening > > > for OOM events for executor ct:1375731364150:0:level2 of framework > > chronos > > > I0805 19:36:06.995985 14749 cgroups_isolator.cpp:543] Forked executor > at > > = > > > 25602 > > > I0805 19:36:07.099602 14739 slave.cpp:1423] Got registration for > executor > > > 'ct:1375731364150:0:level2' of framework chronos > > > I0805 19:36:07.100589 14739 slave.cpp:1538] Flushing queued task > > > ct:1375731364150:0:level2 for executor 'ct:1375731364150:0:level2' of > > > framework chronos > > > I0805 19:36:07.100688 14738 cgroups_isolator.cpp:656] Changing cgroup > > > controls for executor ct:1375731364150:0:level2 of framework chronos > with > > > resources cpus(*):1; m > > > em(*):1024; disk(*):1024 > > > I0805 19:36:07.102645 14738 cgroups_isolator.cpp:964] Updated > > > 'cpu.cfs_period_us' to 100000 and 'cpu.cfs_quota_us' to 100000 for > > executor > > > ct:1375731364150:0:level2 of > > > framework chronos > > > I0805 19:36:07.103688 14738 cgroups_isolator.cpp:889] Updated > > 'cpu.shares' > > > to 1024 for executor ct:1375731364150:0:level2 of framework chronos > > > I0805 19:36:07.104630 14738 cgroups_isolator.cpp:1027] Updated > > > 'memory.soft_limit_in_bytes' to 1073741824 for executor > > > ct:1375731364150:0:level2 of framework chronos > > > I0805 19:36:07.105954 14728 slave.cpp:1720] Handling status update > > > TASK_RUNNING (UUID: 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5) for task > > > ct:1375731364150:0:level2 of framework chronos from executor(1)@ > > > 10.148.197.225:45297 > > > I0805 19:36:07.106722 14741 status_update_manager.cpp:290] Received > > status > > > update TASK_RUNNING (UUID: 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5) for > task > > > ct:1375731364150:0:level2 of framework chronos with checkpoint=false > > > I0805 19:36:07.107151 14741 status_update_manager.cpp:450] Creating > > > StatusUpdate stream for task ct:1375731364150:0:level2 of framework > > chronos > > > I0805 19:36:07.107681 14741 status_update_manager.cpp:336] Forwarding > > > status update TASK_RUNNING (UUID: 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5) > > for > > > task ct:1375731364150:0:level2 of framework chronos to > > > [email protected]:5050 > > > I0805 19:36:07.108211 14728 slave.cpp:1837] Sending acknowledgement for > > > status update TASK_RUNNING (UUID: 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5) > > for > > > task ct:1375731364150:0:level2 of framework chronos to executor(1)@ > > > 10.148.197.225:45297 > > > I0805 19:36:07.110834 14735 status_update_manager.cpp:360] Received > > status > > > update acknowledgement 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5 for task > > > ct:1375731364150:0:level2 of framework chronos > > > I0805 19:36:07.105954 14728 slave.cpp:1720] Handling status update > > > TASK_RUNNING (UUID: 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5) for task > > > ct:1375731364150:0:level2 of framework chronos from executor(1)@ > > > 10.148.197.225:45297 > > > I0805 19:36:07.106722 14741 status_update_manager.cpp:290] Received > > status > > > update TASK_RUNNING (UUID: 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5) for > task > > > ct:1375731364150:0:level2 of framework chronos with checkpoint=false > > > I0805 19:36:07.107151 14741 status_update_manager.cpp:450] Creating > > > StatusUpdate stream for task ct:1375731364150:0:level2 of framework > > chronos > > > I0805 19:36:07.107681 14741 status_update_manager.cpp:336] Forwarding > > > status update TASK_RUNNING (UUID: 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5) > > for > > > task ct:1375731364150:0:level2 of framework chronos to > > > [email protected]:5050 > > > I0805 19:36:07.108211 14728 slave.cpp:1837] Sending acknowledgement for > > > status update TASK_RUNNING (UUID: 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5) > > for > > > task ct:1375731364150:0:level2 of framework chronos to executor(1)@ > > > 10.148.197.225:45297 > > > I0805 19:36:07.110834 14735 status_update_manager.cpp:360] Received > > status > > > update acknowledgement 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5 for task > > > ct:1375731364150:0:level2 of framework chronos > > > I0805 19:36:08.100355 14741 slave.cpp:1720] Handling status update > > > TASK_FINISHED (UUID: 284d4591-06b4-410c-b237-5d740a78ad88) for task > > > ct:1375731364150:0:level2 of framework chronos from executor(1)@ > > > 10.148.197.225:45297 > > > I0805 19:36:08.100966 14739 status_update_manager.cpp:290] Received > > status > > > update TASK_FINISHED (UUID: 284d4591-06b4-410c-b237-5d740a78ad88) for > > task > > > ct:1375731364150:0:level2 of framework chronos with checkpoint=false > > > I0805 19:36:08.101403 14739 status_update_manager.cpp:336] Forwarding > > > status update TASK_FINISHED (UUID: > 284d4591-06b4-410c-b237-5d740a78ad88) > > > for task ct:1375731364150:0:level2 of framework chronos to > > > [email protected]:5050 > > > I0805 19:36:08.101419 14740 cgroups_isolator.cpp:656] Changing cgroup > > > controls for executor ct:1375731364150:0:level2 of framework chronos > with > > > resources > > > I0805 19:36:08.102355 14750 slave.cpp:1837] Sending acknowledgement for > > > status update TASK_FINISHED (UUID: > 284d4591-06b4-410c-b237-5d740a78ad88) > > > for task ct:1375731364150:0:level2 of framework chronos to executor(1)@ > > > 10.148.197.225:45297 > > > I0805 19:36:08.114512 14727 status_update_manager.cpp:360] Received > > status > > > update acknowledgement 284d4591-06b4-410c-b237-5d740a78ad88 for task > > > ct:1375731364150:0:level2 of framework chronos > > > I0805 19:36:08.115033 14727 status_update_manager.cpp:481] Cleaning up > > > status update stream for task ct:1375731364150:0:level2 of framework > > chronos > > > I0805 19:36:09.707617 14741 cgroups_isolator.cpp:851] Executor > > > ct:1375731364150:0:level2 of framework chronos terminated with status 0 > > > I0805 19:36:09.708048 14741 cgroups_isolator.cpp:619] Killing executor > > > ct:1375731364150:0:level2 of framework chronos > > > I0805 19:36:09.709691 14741 cgroups_isolator.cpp:1073] OOM notifier is > > > triggered for executor ct:1375731364150:0:level2 of framework chronos > > with > > > uuid 7869f30a-7700-4cd7-8bb4-4a8ed618b5ce > > > I0805 19:36:09.710186 14741 cgroups_isolator.cpp:1078] Discarded OOM > > > notifier for executor ct:1375731364150:0:level2 of framework chronos > with > > > uuid 7869f30a-7700-4cd7-8bb4-4a8ed618b5ce > > > I0805 19:36:09.715747 14741 cgroups.cpp:1193] Trying to freeze cgroup > > > > > > /cgroup/mesos/framework_chronos_executor_ct:1375731364150:0:level2_tag_7869f30a-7700-4cd7-8bb4-4a8ed618b5ce > > > I0805 19:36:09.718188 14741 cgroups.cpp:1232] Successfully froze cgroup > > > > > > /cgroup/mesos/framework_chronos_executor_ct:1375731364150:0:level2_tag_7869f30a-7700-4cd7-8bb4-4a8ed618b5ce > > > after 1 attempts > > > I0805 19:36:09.724428 14731 cgroups.cpp:1208] Trying to thaw cgroup > > > > > > /cgroup/mesos/framework_chronos_executor_ct:1375731364150:0:level2_tag_7869f30a-7700-4cd7-8bb4-4a8ed618b5ce > > > I0805 19:36:09.724879 14731 cgroups.cpp:1318] Successfully thawed > > > > > > /cgroup/mesos/framework_chronos_executor_ct:1375731364150:0:level2_tag_7869f30a-7700-4cd7-8bb4-4a8ed618b5ce > > > I0805 19:36:09.882351 14740 cgroups_isolator.cpp:1192] Successfully > > > destroyed cgroup > > > > > > mesos/framework_chronos_executor_ct:1375731364150:0:level2_tag_7869f30a-7700-4cd7-8bb4-4a8ed618b5ce > > > I0805 19:36:09.884524 14722 slave.cpp:2088] Executor > > > 'ct:1375731364150:0:level2' of framework chronos has exited with > status 0 > > > I0805 19:36:09.885027 14722 slave.cpp:2223] Cleaning up executor > > > 'ct:1375731364150:0:level2' of framework chronos > > > I0805 19:36:09.889178 14740 gc.cpp:56] Scheduling > > > > > > '/tmp/mesos/slaves/201308022130-200446986-5050-11821-0/frameworks/chronos/executors/ct:1375731364150:0:level2/runs/7869f30a-7700-4cd7-8bb4-4a8ed618b5ce' > > > for removal > > > I0805 19:36:09.889292 14722 slave.cpp:2287] Cleaning up framework > chronos > > > I0805 19:36:09.889802 14740 gc.cpp:56] Scheduling > > > > > > '/tmp/mesos/slaves/201308022130-200446986-5050-11821-0/frameworks/chronos/executors/ct:1375731364150:0:level2' > > > for removal > > > I0805 19:36:09.890043 14728 status_update_manager.cpp:252] Closing > status > > > update streams for framework chronos > > > I0805 19:36:09.890599 14740 gc.cpp:56] Scheduling > > > > > > '/tmp/mesos/slaves/201308022130-200446986-5050-11821-0/frameworks/chronos' > > > for removal > > > I0805 19:36:38.089031 14736 slave.cpp:2539] Current usage 0.47%. Max > > > allowed age: 1.790501830619676days > > > I0805 19:37:01.998739 14748 slave.cpp:2446] Framework chronos seems to > > > have exited. Ignoring registration timeout for executor > > > 'ct:1375731357036:0:level3' > > > I0805 19:37:04.006137 14735 slave.cpp:2446] Framework chronos seems to > > > have exited. Ignoring registration timeout for executor > > > 'ct:1375731357119:0:echo hi' > > > I0805 19:37:06.987412 14724 slave.cpp:2446] Framework chronos seems to > > > have exited. Ignoring registration timeout for executor > > > 'ct:1375731364150:0:level2' > > >
