Hi All, We're running a 2 node heartbeat-2.1.3-3.el4.centos cluster. One of our resources look like this:
crm_resource -x -r R_blackboard_init
R_blackboard_init (lsb:blackboard): Started julia.uvt.nl
raw xml:
<primitive class="lsb" type="blackboard" provider="heartbeat"
id="R_blackboard_init">
<instance_attributes id="R_blackboard_init_instance_attrs">
<attributes/>
</instance_attributes>
<operations>
<op name="start" start_delay="0" disabled="false" role="Started"
id="39658f92-1bce-4429-9df0-06ca53a31477" timeout="180"/>
<op name="monitor" role="Started" start_delay="20s" timeout="90s"
interval="120s" disabled="false"
id="fbf56ef2-04ab-4e9f-9626-0a6d660dc0f0" on_fail="restart"/>
</operations>
<instance_attributes id="R_blackboard_init">
<attributes>
<nvpair name="is_managed" id="R_blackboard_init-is_managed"
value="true"/>
<nvpair name="target_role" id="R_blackboard_init-target_role"
value="started"/>
</attributes>
</instance_attributes>
<meta_attributes id="R_blackboard_init_meta_attrs">
<attributes/>
</meta_attributes>
</primitive>
The status part of the attached lsb blackboard init script actually does
something w/ wget:
echo "foo `wget -T $TIMEOUT -q $URL -O - `"| grep -q "Blackboard
Learning System"
in order to check that the blackboard app is still up.
Here's my problem:
Every so often, last occurrence 60 hours after restarting the monitor
operation, the monitor operation will timeout for no apparent reason and
restart the application.
Apr 27 00:18:45 julia lrmd: [2362]: WARN: R_blackboard_init:monitor
process (PID 19713) timed out (try 1). Killing with signal SIGTERM (15)
.
Apr 27 00:18:45 julia lrmd: [2362]: WARN: Managed
R_blackboard_init:monitor process 19713 killed by signal 15 [SIGTERM -
Termination (ANSI)]
.
Apr 27 00:18:45 julia lrmd: [2362]: WARN: operation monitor[84] on
lsb::blackboard::R_blackboard_init for client 2365, its parameters: is_ma
naged=[true] target_role=[started] CRM_meta_interval=[120000]
CRM_meta_role=[Started] CRM_meta_start_delay=[20000]
CRM_meta_id=[fbf56ef2-04a
b-4e9f-9626-0a6d660dc0f0] CRM_meta_on_fail=[restart]
CRM_meta_timeout=[110000] crm_feature_set=[2.0]
CRM_meta_disabled=[false] CRM_meta_name
=[monitor] : pid [19713] timed out
Apr 27 00:18:45 julia crmd: [2365]: ERROR: process_lrm_event: LRM
operation R_blackboard_init_monitor_120000 (84) Timed Out (timeout=110000m
s)
In order to debug this I have added some extra logs. This is log of a
successful monitor operation:
Mon Apr 28 08:01:37 CEST 2008 start monitor op
Mon Apr 28 08:01:37 CEST 2008 start wget op
Mon Apr 28 08:01:37 CEST 2008 end wget op
Mon Apr 28 08:01:37 CEST 2008 blackboard is running
Mon Apr 28 08:01:37 CEST 2008 end monitor op
(end)
This is the log of the monitor operation that timed out:
Sun Apr 27 00:16:55 CEST 2008 start monitor op
Sun Apr 27 00:16:55 CEST 2008 start wget op
(end)
It looks to me like the monitor operation just dies silently after
starting the wget. Then after 120s the timeout occurs and the restart is
triggered.
Anybody have an idea why this occurs?
thanks for your time,
regards,
Johan
#! /bin/sh # $Id: blackboard 25936 2008-04-22 08:11:09Z jhoeke $ # $URL: https://its-unix-vc.uvt.nl/its-unix/group/bb7appprod/etc/init.d/blackboard $ # Init script for Blackboard. Calls official BB code to start, stop, or # restart the complete system. Uses hack to find out whether the official # stuff produced an error somewhere. In case of error, mails root the # complete log file. # # JH 2003-01-28 Added proper subsys code to help stopping by init. # JH 2003-01-20 Initial version. # # chkconfig: 35 95 05 # description: Starts/stops/restarts the complete Blackboard server system, \ # using the official Blackboard start/stop commands internally. ##### CONFIG PART ######################################################### # Some local constants. Log file places may be changed; log files are # not retained in between invocations of this script anyway (they are # always cleaned out when the script finishes). prog="blackboard" subsys="/var/lock/subsys/$prog" bbctl="/usr/local/blackboard/tools/admin/ServiceController.sh" bblog="/tmp/bbctl.log" errlog="/tmp/bbctl-error.log" ##### CODE BEGINS HERE #################################################### # Source standard function library. . /etc/rc.d/init.d/functions start() { if [ ! -f $subsys ]; then # Start the BB system. echo -n $"Starting $prog: " echo "`date` start blackboard $HOSTNAME. if this is news to you, please investigate " | mailx -s "`date` start blackboard $HOSTNAME" $MAILTO $bbctl services.start > $bblog 2> $errlog if [ -s $errlog ]; then failure $"$base startup" else success $"$base startup" touch $subsys fi echo else echo $"cannot start $prog, $prog is already running" fi return } stop() { # Stop the BB system. echo -n $"Stopping $prog: " $bbctl services.stop > $bblog 2> $errlog su bbuser -c "killall -KILL java" if [ -s $errlog ] ; then failure $"$base shutdown" else success $"$base shutdown" rm -f $subsys fi echo return } restart() { # Restart the BB system. echo -n $"Restarting $prog: " # $bbctl services.restart > $bblog 2> $errlog $bbctl services.stop > $bblog 2> $errlog su bbuser -c "killall -KILL java" $bbctl services.start > $bblog 2> $errlog if [ -s $errlog ] ; then failure $"$base restart" else success $"$base restart" touch $subsys fi echo return } status() { # 2008-02-21 jhoeke # ugly hack to get 3 retries before blackboard is declared dead # this is used in the heartbeat monitor to restart heartbeat after 3 tries URL=http://localhost/webapps/login/ TIMEOUT=18 RETRIES=3 TRY=1 MONLOG=/var/tmp/R_bb_monitor.log if [ -f $MONLOG ] && [ `wc -l $MONLOG | awk '{print $1}'` -ne 5 ]; then # something other than normal in the log cat $MONLOG | mailx -s "`date` monlog wierdness" [EMAIL PROTECTED] mv $MONLOG $MONLOG.`date +%Y%m%d%H%M` fi echo `date` start monitor op > $MONLOG if [ -f $subsys ]; then while [ $TRY -le $RETRIES ]; do echo `date` start wget op >> $MONLOG echo "foo `wget -T $TIMEOUT -q $URL -O - `"| grep -q "Blackboard Learning System" echo `date` end wget op >> $MONLOG EXIT=$? if [ $EXIT -eq 0 ]; then echo `date` $prog is running echo `date` $prog is running >> $MONLOG echo `date` end monitor op>> $MONLOG return 0 else # RETRY echo `date` retrying after $TIMEOUT seconds echo `date` wget or grep ERROR monitoring $URL >> $MONLOG echo `date` retrying after $TIMEOUT seconds >> $MONLOG sleep $TIMEOUT let TRY=TRY+1 fi done fi echo `date` $prog is stopped echo `date` $prog is stopped >> $MONLOG echo `date` end monitor op >> $MONLOG return 3 } # Remove possible stale log files. rm -f $bblog rm -f $errlog MAILTO="[EMAIL PROTECTED]" case "$1" in start) start ;; stop) echo "`date` $1 blackboard $HOSTNAME. if this is news to you, please investigate " | mailx -s "`date` $1 blackboard $HOSTNAME" $MAILTO stop ;; restart) echo "`date` $1 blackboard $HOSTNAME. if this is news to you, please investigate " | mailx -s "`date` $1 blackboard $HOSTNAME" $MAILTO restart ;; status) status exit $? ;; *) echo $"Usage: $0 {start|stop|restart|status}" exit 1 esac # Remove the log files after they have been used. If you prefer, you may # implement some form of rotation here, but then you should not put the # files in /tmp/. # Remove disables for testing #rm -f $bblog #rm -f $errlog # Nobody seems to be looking at this exit code anyway. exit 0
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
