Hi, I am using Pacemaker-1.1. - glue (2012 Jul 16) 2719:18489f275f75 - libqb (2012 Jul 19) 11b20e19beff7f1b6003be0b4c73da8ecf936442 - corosync (2012 Jul 12) 908ed7dcb390c0eade3dddb0cdfe181eb26b2ce2 - pacemaker (2012 Jul 29) 33119da31c235710195c783e5c9a32c6e95b3efc
The timeout value of the _start_ operation of STONITH resource is large. Therefore, even after the start operation is timed out, the process of plugin remains. The following is gdb at the time of STONITH resource starting. ---- [root@dev1 ~]# gdb /usr/libexec/pacemaker/stonithd `pgrep stonithd` (gdb) b run_stonith_agent Breakpoint 1 at 0x7f03f1e00d69: file st_client.c, line 479. (gdb) c Continuing. Breakpoint 1, run_stonith_agent (agent=0xe0f820 "fence_legacy", action=0xe11fb0 "monitor", <snip> 479 { (gdb) bt #0 run_stonith_agent (agent=0xe0f820 "fence_legacy", action=0xe11fb0 "monitor", victim=0x0, device_args=Traceback (most recent call last):0xcffe30, port_map= Traceback (most recent call last):0xcffe80, agent_result=0x7fff70214ef4, output=0x0, track=0xe11d20) at st_client.c:479 #1 0x0000000000406230 in stonith_device_execute (device=0xe10ff0) at commands.c:140 #2 0x0000000000406404 in stonith_device_dispatch (user_data=0xe10ff0) at commands.c:160 #3 0x00007f03f224ad00 in crm_trigger_dispatch (source=0xe11160, callback= 0x4063dd <stonith_device_dispatch>, userdata=0xe11160) at mainloop.c:105 #4 0x0000003642638f0e in g_main_context_dispatch () from /lib64/libglib-2.0.so.0 #5 0x000000364263c938 in ?? () from /lib64/libglib-2.0.so.0 #6 0x000000364263cd55 in g_main_loop_run () from /lib64/libglib-2.0.so.0 #7 0x00000000004056dc in main (argc=1, argv=0x7fff70215278) at main.c:853 (gdb) n 15 Detaching after fork from child process 28915. 510 if (pid) { (gdb) n 15 542 track->pid = pid; (gdb) list 537 track->stdout = p_read_fd; 538 g_child_watch_add(pid, track->done, track); 539 crm_trace("Op: %s on %s, pid: %d, timeout: %ds", action, agent, pid, track->timeout); 540 541 if (track->timeout) { 542 track->pid = pid; 543 track->timer_sigterm = g_timeout_add(1000*track->timeout, st_child_term, track); 544 track->timer_sigkill = g_timeout_add(1000*(track->timeout+5), st_child_kill, track); 545 546 } else { (gdb) n 543 track->timer_sigterm = g_timeout_add(1000*track->timeout, st_child_term, track); (gdb) n 544 track->timer_sigkill = g_timeout_add(1000*(track->timeout+5), st_child_kill, track); (gdb) p agent $1 = 0xe0f820 "fence_legacy" (gdb) p action $2 = 0xe11fb0 "monitor" (gdb) p args $3 = 0xe11500 "plugin=external/libvirt\nhostlist=dev2\nhypervisor_uri=qemu+ssh://n8/system\noption=monitor\n" * (gdb) p track->timeout $4 = 61000 * (gdb) p 1000*track->timeout $5 = 61000000 ---- 1. I added "sleep 3600" to status() of /usr/lib64/stonith/plugins/external/libvirt. [root@dev1 external]# diff -u libvirt.ORG libvirt --- libvirt.ORG 2012-07-17 13:10:01.000000000 +0900 +++ libvirt 2012-07-30 13:36:19.661431208 +0900 @@ -221,6 +221,7 @@ ;; status) + sleep 3600 libvirt_check_config libvirt_status exit $? 2. service corosync start ; service pacemaker start 3. cibadmin -U -x test.xml 4. When I wait for 61 seconds (timeout value of start), [root@dev1 ~]# crm_mon -rf1 ============ Last updated: Mon Jul 30 13:18:48 2012 Last change: Mon Jul 30 13:15:08 2012 via cibadmin on dev1 Stack: corosync Current DC: dev1 (-1788499776) - partition with quorum Version: 1.1.7-33119da 2 Nodes configured, unknown expected votes 1 Resources configured. ============ Online: [ dev1 dev2 ] Full list of resources: f-2 (stonith:external/libvirt): Started dev1 FAILED Migration summary: * Node dev2: * Node dev1: f-2: migration-threshold=1 fail-count=1000000 Failed actions: * f-2_start_0 (node=dev1, call=-1, rc=1, status=Timed Out): unknown error [root@dev1 ~]# ps -ef|egrep "UID|corosync|pacemaker|stonith|fence|sleep" UID PID PPID C STIME TTY TIME CMD root 28840 1 0 13:13 ? 00:00:01 corosync root 28858 1 0 13:13 ? 00:00:00 pacemakerd 496 28860 28858 0 13:13 ? 00:00:00 /usr/libexec/pacemaker/cib root 28861 28858 0 13:13 ? 00:00:00 /usr/libexec/pacemaker/stonithd root 28862 28858 73 13:13 ? 00:04:16 /usr/libexec/pacemaker/lrmd 496 28863 28858 0 13:13 ? 00:00:00 /usr/libexec/pacemaker/attrd 496 28864 28858 0 13:13 ? 00:00:00 /usr/libexec/pacemaker/pengine 496 28865 28858 51 13:13 ? 00:02:58 /usr/libexec/pacemaker/crmd * root 28915 28861 0 13:15 ? 00:00:00 /usr/bin/perl /usr/sbin/fence_legacy * root 28916 28915 0 13:15 ? 00:00:00 stonith -t external/libvirt -E -S * root 28921 28916 0 13:15 ? 00:00:00 /bin/sh /usr/lib64/stonith/plugins/external/libvirt status root 28925 28921 0 13:15 ? 00:00:00 sleep 3600 [root@dev1 ~]# top -bn1 top - 13:21:26 up 5 days, 3:23, 5 users, load average: 1.99, 1.42, 0.72 Tasks: 198 total, 3 running, 195 sleeping, 0 stopped, 0 zombie Cpu(s): 0.7%us, 0.7%sy, 0.0%ni, 98.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 5089052k total, 2423104k used, 2665948k free, 265756k buffers Swap: 1048568k total, 0k used, 1048568k free, 1717712k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND * 28862 root 20 0 83816 3412 2572 R 98.2 0.1 6:17.18 lrmd * 28865 hacluste 20 0 166m 6380 3428 R 98.2 0.1 4:59.84 crmd 28860 hacluste 20 0 93888 7192 4472 S 2.0 0.1 0:00.23 cib 29052 root 20 0 15024 1136 792 R 2.0 0.0 0:00.01 top 1 root 20 0 19348 1520 1212 S 0.0 0.0 0:00.77 init 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd 3 root RT 0 0 0 0 S 0.0 0.0 0:06.85 migration/0 4 root 20 0 0 0 0 S 0.0 0.0 14:25.15 ksoftirqd/0 5 root RT 0 0 0 0 S 0.0 0.0 0:00.10 migration/0 Best Regards, Kazunori INOUE
<cib admin_epoch="0" epoch="100" num_updates="0"> <configuration> <crm_config> <cluster_property_set id="cib-bootstrap-options"> <nvpair id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy" value="ignore"/> <nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" value="true"/> <nvpair id="cib-bootstrap-options-startup-fencing" name="startup-fencing" value="false"/> <nvpair id="cib-bootstrap-options-stonith-timeout" name="stonith-timeout" value="60s"/> <nvpair id="cib-bootstrap-options-cluster-delay" name="cluster-delay" value="10s"/> </cluster_property_set> </crm_config> <resources> <primitive class="stonith" id="f-2" type="external/libvirt"> <instance_attributes id="f-2-instance_attributes"> <nvpair id="f-2-instance_attributes-hostlist" name="hostlist" value="dev2"/> <nvpair id="f-2-instance_attributes-hypervisor_uri" name="hypervisor_uri" value="qemu+ssh://n8/system"/> </instance_attributes> <operations> <op id="f-2-start-0" interval="0s" name="start" timeout="61s"/> <op id="f-2-monitor-3600" interval="3600s" name="monitor" timeout="60s"/> <op id="f-2-stop-0" interval="0s" name="stop" timeout="60s"/> </operations> </primitive> </resources> <constraints> <rsc_location id="rl-f-2" rsc="f-2"> <rule id="rl-f-2-rule" score="100"><expression attribute="#uname" id="rl-f-2-exp" operation="eq" value="dev1"/></rule> <rule id="rl-f-2-rule-0" score="-INFINITY"><expression attribute="#uname" id="rl-f-2-exp-0" operation="eq" value="dev2"/></rule> </rsc_location> </constraints> <rsc_defaults> <meta_attributes id="rsc-options"> <nvpair id="rsc-options-resource-stickiness" name="resource-stickiness" value="INFINITY"/> <nvpair id="rsc-options-migration-threshold" name="migration-threshold" value="1"/> </meta_attributes> </rsc_defaults> </configuration> </cib>
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org