----- Original Message ----- > From: "Kazunori INOUE" <inouek...@intellilink.co.jp> > To: "The Pacemaker cluster resource manager" <pacemaker@oss.clusterlabs.org> > Sent: Thursday, August 2, 2012 4:37:45 AM > Subject: Re: [Pacemaker] Timeout value of STONITH resource is too large > > Hi David, > > The resource (process forked from stonithd) stays until sleep is > completed. > stonithd does not call st_child_term(). > > I set 61 seconds in timeout of start. > Therefore, the value of track->timeout in run_stonith_agent() is > 61000(ms). > However, it is after '1000*track->timeout' milliseconds that > st_child_term() > is called. That is, it is approximately _17 hours_ later.
Whoa, that isn't good at all! Please create a bug report for this. Include your config. -- Vossel > - lib/fencing/st_client.c > 476 run_stonith_agent(const char *agent, const char *action, const > char *victim, > 477 GHashTable * device_args, GHashTable * port_map, > int *agent_result, char **output, > 478 async_command_t * track) > > 541 if (track->timeout) { > 542 track->pid = pid; > 543 track->timer_sigterm = > g_timeout_add(1000*track->timeout, st_child_term, track); > 544 track->timer_sigkill = > g_timeout_add(1000*(track->timeout+5), st_child_kill, track); > > Best Regards, > Kazunori INOUE > > (12.08.01 00:01), David Vossel wrote: > > ----- Original Message ----- > >> From: "Kazunori INOUE" <inouek...@intellilink.co.jp> > >> To: "pacemaker@oss" <pacemaker@oss.clusterlabs.org> > >> Cc: shimaza...@intellilink.co.jp > >> Sent: Monday, July 30, 2012 5:13:40 AM > >> Subject: [Pacemaker] Timeout value of STONITH resource is too > >> large > >> > >> Hi, > >> > >> I am using Pacemaker-1.1. > >> - glue (2012 Jul 16) 2719:18489f275f75 > >> - libqb (2012 Jul 19) > >> 11b20e19beff7f1b6003be0b4c73da8ecf936442 > >> - corosync (2012 Jul 12) > >> 908ed7dcb390c0eade3dddb0cdfe181eb26b2ce2 > >> - pacemaker (2012 Jul 29) > >> 33119da31c235710195c783e5c9a32c6e95b3efc > >> > >> The timeout value of the _start_ operation of STONITH resource is > >> large. > >> Therefore, even after the start operation is timed out, the > >> process > >> of > >> plugin remains. > > > > How long after the timeout occurs does the process stay around? > > Does it terminate a few seconds after the timeout, or does the > > resource wait until the entire duration of the sleep 3600 > > finishes? > > > > -- Vossel > > > > > >> > >> The following is gdb at the time of STONITH resource starting. > >> ---- > >> [root@dev1 ~]# gdb /usr/libexec/pacemaker/stonithd `pgrep > >> stonithd` > >> (gdb) b run_stonith_agent > >> Breakpoint 1 at 0x7f03f1e00d69: file st_client.c, line 479. > >> (gdb) c > >> Continuing. > >> > >> Breakpoint 1, run_stonith_agent (agent=0xe0f820 > >> "fence_legacy", > >> action=0xe11fb0 "monitor", > >> <snip> > >> 479 { > >> (gdb) bt > >> #0 run_stonith_agent (agent=0xe0f820 "fence_legacy", > >> action=0xe11fb0 "monitor", > >> victim=0x0, device_args=Traceback (most recent call > >> last):0xcffe30, port_map= > >> Traceback (most recent call last):0xcffe80, > >> agent_result=0x7fff70214ef4, > >> output=0x0, track=0xe11d20) at st_client.c:479 > >> #1 0x0000000000406230 in stonith_device_execute > >> (device=0xe10ff0) > >> at commands.c:140 > >> #2 0x0000000000406404 in stonith_device_dispatch > >> (user_data=0xe10ff0) at commands.c:160 > >> #3 0x00007f03f224ad00 in crm_trigger_dispatch > >> (source=0xe11160, > >> callback= > >> 0x4063dd <stonith_device_dispatch>, userdata=0xe11160) at > >> mainloop.c:105 > >> #4 0x0000003642638f0e in g_main_context_dispatch () from > >> /lib64/libglib-2.0.so.0 > >> #5 0x000000364263c938 in ?? () from /lib64/libglib-2.0.so.0 > >> #6 0x000000364263cd55 in g_main_loop_run () from > >> /lib64/libglib-2.0.so.0 > >> #7 0x00000000004056dc in main (argc=1, argv=0x7fff70215278) > >> at > >> main.c:853 > >> (gdb) n 15 > >> Detaching after fork from child process 28915. > >> 510 if (pid) { > >> (gdb) n 15 > >> 542 track->pid = pid; > >> (gdb) list > >> 537 track->stdout = p_read_fd; > >> 538 g_child_watch_add(pid, track->done, track); > >> 539 crm_trace("Op: %s on %s, pid: %d, timeout: > >> %ds", > >> action, agent, pid, track->timeout); > >> 540 > >> 541 if (track->timeout) { > >> 542 track->pid = pid; > >> 543 track->timer_sigterm = > >> g_timeout_add(1000*track->timeout, st_child_term, track); > >> 544 track->timer_sigkill = > >> g_timeout_add(1000*(track->timeout+5), st_child_kill, track); > >> 545 > >> 546 } else { > >> (gdb) n > >> 543 track->timer_sigterm = > >> g_timeout_add(1000*track->timeout, st_child_term, track); > >> (gdb) n > >> 544 track->timer_sigkill = > >> g_timeout_add(1000*(track->timeout+5), st_child_kill, track); > >> (gdb) p agent > >> $1 = 0xe0f820 "fence_legacy" > >> (gdb) p action > >> $2 = 0xe11fb0 "monitor" > >> (gdb) p args > >> $3 = 0xe11500 > >> > >> "plugin=external/libvirt\nhostlist=dev2\nhypervisor_uri=qemu+ssh://n8/system\noption=monitor\n" > >> * (gdb) p track->timeout > >> $4 = 61000 > >> * (gdb) p 1000*track->timeout > >> $5 = 61000000 > >> ---- > >> 1. I added "sleep 3600" to status() of > >> /usr/lib64/stonith/plugins/external/libvirt. > >> > >> [root@dev1 external]# diff -u libvirt.ORG libvirt > >> --- libvirt.ORG 2012-07-17 13:10:01.000000000 +0900 > >> +++ libvirt 2012-07-30 13:36:19.661431208 +0900 > >> @@ -221,6 +221,7 @@ > >> ;; > >> > >> status) > >> + sleep 3600 > >> libvirt_check_config > >> libvirt_status > >> exit $? > >> > >> 2. service corosync start ; service pacemaker start > >> 3. cibadmin -U -x test.xml > >> 4. When I wait for 61 seconds (timeout value of start), > >> > >> [root@dev1 ~]# crm_mon -rf1 > >> ============ > >> Last updated: Mon Jul 30 13:18:48 2012 > >> Last change: Mon Jul 30 13:15:08 2012 via cibadmin on dev1 > >> Stack: corosync > >> Current DC: dev1 (-1788499776) - partition with quorum > >> Version: 1.1.7-33119da > >> 2 Nodes configured, unknown expected votes > >> 1 Resources configured. > >> ============ > >> > >> Online: [ dev1 dev2 ] > >> > >> Full list of resources: > >> > >> f-2 (stonith:external/libvirt): Started dev1 FAILED > >> > >> Migration summary: > >> * Node dev2: > >> * Node dev1: > >> f-2: migration-threshold=1 fail-count=1000000 > >> > >> Failed actions: > >> * f-2_start_0 (node=dev1, call=-1, rc=1, status=Timed Out): > >> unknown error > >> > >> [root@dev1 ~]# ps -ef|egrep > >> "UID|corosync|pacemaker|stonith|fence|sleep" > >> UID PID PPID C STIME TTY TIME CMD > >> root 28840 1 0 13:13 ? 00:00:01 corosync > >> root 28858 1 0 13:13 ? 00:00:00 pacemakerd > >> 496 28860 28858 0 13:13 ? 00:00:00 > >> /usr/libexec/pacemaker/cib > >> root 28861 28858 0 13:13 ? 00:00:00 > >> /usr/libexec/pacemaker/stonithd > >> root 28862 28858 73 13:13 ? 00:04:16 > >> /usr/libexec/pacemaker/lrmd > >> 496 28863 28858 0 13:13 ? 00:00:00 > >> /usr/libexec/pacemaker/attrd > >> 496 28864 28858 0 13:13 ? 00:00:00 > >> /usr/libexec/pacemaker/pengine > >> 496 28865 28858 51 13:13 ? 00:02:58 > >> /usr/libexec/pacemaker/crmd > >> * root 28915 28861 0 13:15 ? 00:00:00 /usr/bin/perl > >> /usr/sbin/fence_legacy > >> * root 28916 28915 0 13:15 ? 00:00:00 stonith -t > >> external/libvirt > >> -E -S > >> * root 28921 28916 0 13:15 ? 00:00:00 /bin/sh > >> /usr/lib64/stonith/plugins/external/libvirt status > >> root 28925 28921 0 13:15 ? 00:00:00 sleep 3600 > >> > >> [root@dev1 ~]# top -bn1 > >> top - 13:21:26 up 5 days, 3:23, 5 users, load average: > >> 1.99, > >> 1.42, 0.72 > >> Tasks: 198 total, 3 running, 195 sleeping, 0 stopped, 0 > >> zombie > >> Cpu(s): 0.7%us, 0.7%sy, 0.0%ni, 98.5%id, 0.0%wa, 0.0%hi, > >> 0.0%si, 0.0%st > >> Mem: 5089052k total, 2423104k used, 2665948k free, > >> 265756k > >> buffers > >> Swap: 1048568k total, 0k used, 1048568k free, > >> 1717712k > >> cached > >> > >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ > >> COMMAND > >> * 28862 root 20 0 83816 3412 2572 R 98.2 0.1 6:17.18 > >> lrmd > >> * 28865 hacluste 20 0 166m 6380 3428 R 98.2 0.1 4:59.84 > >> crmd > >> 28860 hacluste 20 0 93888 7192 4472 S 2.0 0.1 0:00.23 > >> cib > >> 29052 root 20 0 15024 1136 792 R 2.0 0.0 0:00.01 > >> top > >> 1 root 20 0 19348 1520 1212 S 0.0 0.0 0:00.77 > >> init > >> 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 > >> kthreadd > >> 3 root RT 0 0 0 0 S 0.0 0.0 0:06.85 > >> migration/0 > >> 4 root 20 0 0 0 0 S 0.0 0.0 14:25.15 > >> ksoftirqd/0 > >> 5 root RT 0 0 0 0 S 0.0 0.0 0:00.10 > >> migration/0 > >> > >> Best Regards, > >> Kazunori INOUE > >> > >> _______________________________________________ > >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >> > >> Project Home: http://www.clusterlabs.org > >> Getting started: > >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > >> Bugs: http://bugs.clusterlabs.org > >> > > > > _______________________________________________ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org