Hi, On Tue, Jun 17, 2008 at 11:16:00AM +0900, HIDEO YAMAUCHI wrote: > Hi, > > I confirmed behavior of the time-out of the run time of STONITH.(Heartbeat > 2.1.3 and ibmrsa-telnet) > > I confirmed it by the next sequence. > > 1)Start Heartbeat in two nodes. > 2)Hung up in one node. > 3)Time-out in STONITH.(Put a sleep code or drop all power supplies of the > node.) > > But, unlike normal RA, plural RA of STONITH are started. > > I think that RA of STONITH should be started again after I was > murdered properly like normal RA.
And that is what happens. After one stonith reset operation fails, this time due to the timeout, another one is scheduled, i.e. another stonith resource instance is started. > //-------The state of the ps command > Last login: Tue Jun 17 10:00:54 2008 from 172.30.96.92 > [EMAIL PROTECTED] ~]# ps -ef |grep ibm > root 4562 1 0 Jun12 ? 00:00:00 /sbin/ibmasm > root 4823 4562 0 Jun12 ? 00:00:00 /sbin/ibmasm > root 11913 11912 0 10:23 ? 00:00:00 /usr/bin/python > /usr/lib64/stonith/plugins/external/ibmrsa-telnet reset x3650a > root 11947 11917 0 10:23 pts/1 00:00:00 grep ibm > [EMAIL PROTECTED] ~]# ps -ef |grep ibm > root 4562 1 0 Jun12 ? 00:00:00 /sbin/ibmasm > root 4823 4562 0 Jun12 ? 00:00:00 /sbin/ibmasm > root 11913 1 0 10:23 ? 00:00:00 /usr/bin/python > /usr/lib64/stonith/plugins/external/ibmrsa-telnet reset x3650a > root 11962 1 0 10:26 ? 00:00:00 /usr/bin/python > /usr/lib64/stonith/plugins/external/ibmrsa-telnet reset x3650a > root 11977 1 0 10:29 ? 00:00:00 /usr/bin/python > /usr/lib64/stonith/plugins/external/ibmrsa-telnet reset x3650a > root 11994 11993 0 10:32 ? 00:00:00 /usr/bin/python > /usr/lib64/stonith/plugins/external/ibmrsa-telnet reset x3650a As you can see the instances are started three minutes apart from each other. Though I wonder why the earlier ones remain. The only possible explanation is that the stonith resource forks a new process, though I don't know ibmrsa-telnet to confirm that. From the logs: stonithd[11751]: 2008/06/17_10:41:33 WARN: Managed external_r_stonith-node01_1 process 12047 killed by signal 9 [SIGKILL - Kill, unblockable]. stonithd[11751]: 2008/06/17_10:41:33 WARN: child exits, but not tracked. and from the process list: root 12048 0.0 0.0 108736 4228 ? S 10:38 0:00 /usr/bin/python /usr/lib64/stonith/plugins/external/ibmrsa-telnet reset x3650a The process 12047, which was probably the parent of PID 12048 had been killed due to the timeout. Since it has been brutally removed by signal 9, its child remained. This should probably be changed, i.e. the process should be first sent a TERM signal so that it has a chance to notify children and otherwise do a proper cleanup. I opened a bugzilla for this issue: http://developerbugs.linux-foundation.org/show_bug.cgi?id=1922 Thanks, Dejan _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
