Hi, On Mon, Apr 28, 2008 at 08:18:32AM +0200, Johan Hoeke wrote: > Hi All, > > We're running a 2 node heartbeat-2.1.3-3.el4.centos cluster. One of our > resources look like this: > > crm_resource -x -r R_blackboard_init > R_blackboard_init (lsb:blackboard): Started julia.uvt.nl > raw xml: > <primitive class="lsb" type="blackboard" provider="heartbeat" > id="R_blackboard_init"> > <instance_attributes id="R_blackboard_init_instance_attrs"> > <attributes/> > </instance_attributes> > <operations> > <op name="start" start_delay="0" disabled="false" role="Started" > id="39658f92-1bce-4429-9df0-06ca53a31477" timeout="180"/> > <op name="monitor" role="Started" start_delay="20s" timeout="90s" > interval="120s" disabled="false" > id="fbf56ef2-04ab-4e9f-9626-0a6d660dc0f0" on_fail="restart"/> > </operations> > <instance_attributes id="R_blackboard_init"> > <attributes> > <nvpair name="is_managed" id="R_blackboard_init-is_managed" > value="true"/> > <nvpair name="target_role" id="R_blackboard_init-target_role" > value="started"/> > </attributes> > </instance_attributes> > <meta_attributes id="R_blackboard_init_meta_attrs"> > <attributes/> > </meta_attributes> > </primitive> > > The status part of the attached lsb blackboard init script actually does > something w/ wget: > > echo "foo `wget -T $TIMEOUT -q $URL -O - `"| grep -q "Blackboard > Learning System" > > in order to check that the blackboard app is still up. > > Here's my problem: > > Every so often, last occurrence 60 hours after restarting the monitor > operation, the monitor operation will timeout for no apparent reason and > restart the application. > > Apr 27 00:18:45 julia lrmd: [2362]: WARN: R_blackboard_init:monitor > process (PID 19713) timed out (try 1). Killing with signal SIGTERM (15) > . > Apr 27 00:18:45 julia lrmd: [2362]: WARN: Managed > R_blackboard_init:monitor process 19713 killed by signal 15 [SIGTERM - > Termination (ANSI)] > . > Apr 27 00:18:45 julia lrmd: [2362]: WARN: operation monitor[84] on > lsb::blackboard::R_blackboard_init for client 2365, its parameters: is_ma > naged=[true] target_role=[started] CRM_meta_interval=[120000] > CRM_meta_role=[Started] CRM_meta_start_delay=[20000] > CRM_meta_id=[fbf56ef2-04a > b-4e9f-9626-0a6d660dc0f0] CRM_meta_on_fail=[restart] > CRM_meta_timeout=[110000] crm_feature_set=[2.0] > CRM_meta_disabled=[false] CRM_meta_name > =[monitor] : pid [19713] timed out > Apr 27 00:18:45 julia crmd: [2365]: ERROR: process_lrm_event: LRM > operation R_blackboard_init_monitor_120000 (84) Timed Out (timeout=110000m > s) > > In order to debug this I have added some extra logs. This is log of a > successful monitor operation: > > Mon Apr 28 08:01:37 CEST 2008 start monitor op > Mon Apr 28 08:01:37 CEST 2008 start wget op > Mon Apr 28 08:01:37 CEST 2008 end wget op > Mon Apr 28 08:01:37 CEST 2008 blackboard is running > Mon Apr 28 08:01:37 CEST 2008 end monitor op > (end) > > This is the log of the monitor operation that timed out: > > Sun Apr 27 00:16:55 CEST 2008 start monitor op > Sun Apr 27 00:16:55 CEST 2008 start wget op > (end) > > It looks to me like the monitor operation just dies silently after > starting the wget. Then after 120s the timeout occurs and the restart is > triggered.
> Anybody have an idea why this occurs? The operation definitely times out, probably because wget blocks (what's your $TIMEOUT set to?). The monitor operation didn't disappear and wouldn't without lrmd noticing. All processes are managed. If you want to debug the script, use set -x to see what's going on. Thanks, Dejan > thanks for your time, > > regards, > > Johan > #! /bin/sh > > # $Id: blackboard 25936 2008-04-22 08:11:09Z jhoeke $ > # $URL: > https://its-unix-vc.uvt.nl/its-unix/group/bb7appprod/etc/init.d/blackboard $ > > # Init script for Blackboard. Calls official BB code to start, stop, or > # restart the complete system. Uses hack to find out whether the official > # stuff produced an error somewhere. In case of error, mails root the > # complete log file. > # > # JH 2003-01-28 Added proper subsys code to help stopping by init. > # JH 2003-01-20 Initial version. > # > # chkconfig: 35 95 05 > # description: Starts/stops/restarts the complete Blackboard server system, \ > # using the official Blackboard start/stop commands internally. > > ##### CONFIG PART ######################################################### > > # Some local constants. Log file places may be changed; log files are > # not retained in between invocations of this script anyway (they are > # always cleaned out when the script finishes). > prog="blackboard" > subsys="/var/lock/subsys/$prog" > bbctl="/usr/local/blackboard/tools/admin/ServiceController.sh" > bblog="/tmp/bbctl.log" > errlog="/tmp/bbctl-error.log" > > ##### CODE BEGINS HERE #################################################### > > # Source standard function library. > . /etc/rc.d/init.d/functions > > start() { > if [ ! -f $subsys ]; then > # Start the BB system. > echo -n $"Starting $prog: " > echo "`date` start blackboard $HOSTNAME. if this is news to you, please > investigate " | mailx -s "`date` start blackboard $HOSTNAME" $MAILTO > $bbctl services.start > $bblog 2> $errlog > if [ -s $errlog ]; then > failure $"$base startup" > else > success $"$base startup" > touch $subsys > fi > echo > else > echo $"cannot start $prog, $prog is already running" > fi > return > } > > stop() { > # Stop the BB system. > echo -n $"Stopping $prog: " > $bbctl services.stop > $bblog 2> $errlog > su bbuser -c "killall -KILL java" > if [ -s $errlog ] ; then > failure $"$base shutdown" > else > success $"$base shutdown" > rm -f $subsys > fi > echo > return > } > > restart() { > # Restart the BB system. > echo -n $"Restarting $prog: " > # $bbctl services.restart > $bblog 2> $errlog > $bbctl services.stop > $bblog 2> $errlog > su bbuser -c "killall -KILL java" > $bbctl services.start > $bblog 2> $errlog > if [ -s $errlog ] ; then > failure $"$base restart" > else > success $"$base restart" > touch $subsys > fi > echo > return > } > > status() { > # 2008-02-21 jhoeke > # ugly hack to get 3 retries before blackboard is declared dead > # this is used in the heartbeat monitor to restart heartbeat after 3 tries > URL=http://localhost/webapps/login/ > TIMEOUT=18 > RETRIES=3 > TRY=1 > MONLOG=/var/tmp/R_bb_monitor.log > if [ -f $MONLOG ] && [ `wc -l $MONLOG | awk '{print $1}'` -ne 5 ]; then > # something other than normal in the log > cat $MONLOG | mailx -s "`date` monlog wierdness" [EMAIL PROTECTED] > mv $MONLOG $MONLOG.`date +%Y%m%d%H%M` > fi > echo `date` start monitor op > $MONLOG > if [ -f $subsys ]; then > while [ $TRY -le $RETRIES ]; do > echo `date` start wget op >> $MONLOG > echo "foo `wget -T $TIMEOUT -q $URL -O - `"| grep -q "Blackboard > Learning System" > echo `date` end wget op >> $MONLOG > EXIT=$? > if [ $EXIT -eq 0 ]; then > echo `date` $prog is running > echo `date` $prog is running >> $MONLOG > echo `date` end monitor op>> $MONLOG > return 0 > else > # RETRY > echo `date` retrying after $TIMEOUT seconds > echo `date` wget or grep ERROR monitoring $URL >> $MONLOG > echo `date` retrying after $TIMEOUT seconds >> $MONLOG > sleep $TIMEOUT > let TRY=TRY+1 > fi > done > fi > echo `date` $prog is stopped > echo `date` $prog is stopped >> $MONLOG > echo `date` end monitor op >> $MONLOG > return 3 > > } > > # Remove possible stale log files. > rm -f $bblog > rm -f $errlog > > MAILTO="[EMAIL PROTECTED]" > > case "$1" in > start) > start > ;; > stop) > echo "`date` $1 blackboard $HOSTNAME. if this is news to you, > please investigate " | mailx -s "`date` $1 blackboard $HOSTNAME" $MAILTO > stop > ;; > restart) > echo "`date` $1 blackboard $HOSTNAME. if this is news to you, > please investigate " | mailx -s "`date` $1 blackboard $HOSTNAME" $MAILTO > restart > ;; > status) > status > exit $? > ;; > *) > echo $"Usage: $0 {start|stop|restart|status}" > exit 1 > esac > > # Remove the log files after they have been used. If you prefer, you may > # implement some form of rotation here, but then you should not put the > # files in /tmp/. > # Remove disables for testing > #rm -f $bblog > #rm -f $errlog > > # Nobody seems to be looking at this exit code anyway. > exit 0 > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
