Re: [Linux-HA] About time-out of STONITH.

Dejan Muhamedagic Tue, 17 Jun 2008 06:35:02 -0700

Hi,

On Tue, Jun 17, 2008 at 11:16:00AM +0900, HIDEO YAMAUCHI wrote:
> Hi,
> 
> I confirmed behavior of the time-out of the run time of STONITH.(Heartbeat 
> 2.1.3 and ibmrsa-telnet)
> 
> I confirmed it by the next sequence.
> 
> 1)Start Heartbeat in two nodes.
> 2)Hung up in one node.
> 3)Time-out in STONITH.(Put a sleep code or drop all power supplies of the 
> node.)
> 
> But, unlike normal RA, plural RA of STONITH are started.
> 
> I think that RA of STONITH should be started again after I was
> murdered properly like normal RA.


And that is what happens. After one stonith reset operation
fails, this time due to the timeout, another one is scheduled,
i.e. another stonith resource instance is started.

> //-------The state of the ps command
> Last login: Tue Jun 17 10:00:54 2008 from 172.30.96.92
> [EMAIL PROTECTED] ~]# ps -ef |grep ibm
> root      4562     1  0 Jun12 ?        00:00:00 /sbin/ibmasm
> root      4823  4562  0 Jun12 ?        00:00:00 /sbin/ibmasm
> root     11913 11912  0 10:23 ?        00:00:00 /usr/bin/python
> /usr/lib64/stonith/plugins/external/ibmrsa-telnet reset x3650a
> root     11947 11917  0 10:23 pts/1    00:00:00 grep ibm
> [EMAIL PROTECTED] ~]# ps -ef |grep ibm
> root      4562     1  0 Jun12 ?        00:00:00 /sbin/ibmasm
> root      4823  4562  0 Jun12 ?        00:00:00 /sbin/ibmasm
> root     11913     1  0 10:23 ?        00:00:00 /usr/bin/python
> /usr/lib64/stonith/plugins/external/ibmrsa-telnet reset x3650a
> root     11962     1  0 10:26 ?        00:00:00 /usr/bin/python
> /usr/lib64/stonith/plugins/external/ibmrsa-telnet reset x3650a
> root     11977     1  0 10:29 ?        00:00:00 /usr/bin/python
> /usr/lib64/stonith/plugins/external/ibmrsa-telnet reset x3650a
> root     11994 11993  0 10:32 ?        00:00:00 /usr/bin/python
> /usr/lib64/stonith/plugins/external/ibmrsa-telnet reset x3650a

As you can see the instances are started three minutes apart from
each other. Though I wonder why the earlier ones remain. The only
possible explanation is that the stonith resource forks a new
process, though I don't know ibmrsa-telnet to confirm that. From
the logs:

stonithd[11751]: 2008/06/17_10:41:33 WARN: Managed external_r_stonith-node01_1 
process 12047 killed by signal 9 [SIGKILL - Kill, unblockable].
stonithd[11751]: 2008/06/17_10:41:33 WARN: child exits, but not tracked.

and from the process list:

root     12048  0.0  0.0 108736  4228 ?        S    10:38   0:00 
/usr/bin/python /usr/lib64/stonith/plugins/external/ibmrsa-telnet reset x3650a

The process 12047, which was probably the parent of PID 12048 had
been killed due to the timeout. Since it has been brutally
removed by signal 9, its child remained. This should probably be
changed, i.e. the process should be first sent a TERM signal so
that it has a chance to notify children and otherwise do a proper
cleanup.

I opened a bugzilla for this issue:
http://developerbugs.linux-foundation.org/show_bug.cgi?id=1922

Thanks,

Dejan
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] About time-out of STONITH.

Reply via email to