Hi All,

We're running a 2 node heartbeat-2.1.3-3.el4.centos cluster. One of our
resources look like this:

crm_resource -x -r R_blackboard_init
R_blackboard_init       (lsb:blackboard):       Started julia.uvt.nl
raw xml:
 <primitive class="lsb" type="blackboard" provider="heartbeat"
id="R_blackboard_init">
   <instance_attributes id="R_blackboard_init_instance_attrs">
     <attributes/>
   </instance_attributes>
   <operations>
     <op name="start" start_delay="0" disabled="false" role="Started"
id="39658f92-1bce-4429-9df0-06ca53a31477" timeout="180"/>
     <op name="monitor" role="Started" start_delay="20s" timeout="90s"
interval="120s" disabled="false"
id="fbf56ef2-04ab-4e9f-9626-0a6d660dc0f0" on_fail="restart"/>
   </operations>
   <instance_attributes id="R_blackboard_init">
     <attributes>
       <nvpair name="is_managed" id="R_blackboard_init-is_managed"
value="true"/>
       <nvpair name="target_role" id="R_blackboard_init-target_role"
value="started"/>
     </attributes>
   </instance_attributes>
   <meta_attributes id="R_blackboard_init_meta_attrs">
     <attributes/>
   </meta_attributes>
 </primitive>

The status part of the attached lsb blackboard init script actually does
something w/ wget:

echo "foo `wget -T $TIMEOUT -q $URL -O - `"| grep -q "Blackboard
Learning System"

in order to check that the blackboard app is still up.

Here's my problem:

Every so often, last occurrence 60 hours after restarting the monitor
operation, the monitor operation will timeout for no apparent reason and
restart the application.

Apr 27 00:18:45 julia lrmd: [2362]: WARN: R_blackboard_init:monitor
process (PID 19713) timed out (try 1).  Killing with signal SIGTERM (15)
.
Apr 27 00:18:45 julia lrmd: [2362]: WARN: Managed
R_blackboard_init:monitor process 19713 killed by signal 15 [SIGTERM -
Termination (ANSI)]
.
Apr 27 00:18:45 julia lrmd: [2362]: WARN: operation monitor[84] on
lsb::blackboard::R_blackboard_init for client 2365, its parameters: is_ma
naged=[true] target_role=[started] CRM_meta_interval=[120000]
CRM_meta_role=[Started] CRM_meta_start_delay=[20000]
CRM_meta_id=[fbf56ef2-04a
b-4e9f-9626-0a6d660dc0f0] CRM_meta_on_fail=[restart]
CRM_meta_timeout=[110000] crm_feature_set=[2.0]
CRM_meta_disabled=[false] CRM_meta_name
=[monitor] : pid [19713] timed out
Apr 27 00:18:45 julia crmd: [2365]: ERROR: process_lrm_event: LRM
operation R_blackboard_init_monitor_120000 (84) Timed Out (timeout=110000m
s)

In order to debug this I have added some extra logs. This is log of a
successful monitor operation:

Mon Apr 28 08:01:37 CEST 2008 start monitor op
Mon Apr 28 08:01:37 CEST 2008 start wget op
Mon Apr 28 08:01:37 CEST 2008 end wget op
Mon Apr 28 08:01:37 CEST 2008 blackboard is running
Mon Apr 28 08:01:37 CEST 2008 end monitor op
(end)

This is the log of the monitor operation that timed out:

Sun Apr 27 00:16:55 CEST 2008 start monitor op
Sun Apr 27 00:16:55 CEST 2008 start wget op
(end)

It looks to me like the monitor operation just dies silently after
starting the wget. Then after 120s the timeout occurs and the restart is
triggered.

Anybody have an idea why this occurs?

thanks for your time,

regards,

Johan
#! /bin/sh

# $Id: blackboard 25936 2008-04-22 08:11:09Z jhoeke $
# $URL: 
https://its-unix-vc.uvt.nl/its-unix/group/bb7appprod/etc/init.d/blackboard $

# Init script for Blackboard. Calls official BB code to start, stop, or
# restart the complete system. Uses hack to find out whether the official
# stuff produced an error somewhere. In case of error, mails root the
# complete log file.
#
# JH 2003-01-28 Added proper subsys code to help stopping by init.
# JH 2003-01-20 Initial version.
#
# chkconfig: 35 95 05
# description: Starts/stops/restarts the complete Blackboard server system, \
#             using the official Blackboard start/stop commands internally.

##### CONFIG PART #########################################################

# Some local constants. Log file places may be changed; log files are
# not retained in between invocations of this script anyway (they are
# always cleaned out when the script finishes).
prog="blackboard"
subsys="/var/lock/subsys/$prog"
bbctl="/usr/local/blackboard/tools/admin/ServiceController.sh"
bblog="/tmp/bbctl.log"
errlog="/tmp/bbctl-error.log"

##### CODE BEGINS HERE ####################################################

# Source standard function library.
. /etc/rc.d/init.d/functions

start() {
    if [ ! -f $subsys ]; then
    # Start the BB system.
        echo -n $"Starting $prog: "
        echo "`date` start blackboard $HOSTNAME. if this is news to you, please 
investigate " | mailx -s "`date` start blackboard $HOSTNAME" $MAILTO
        $bbctl services.start > $bblog 2> $errlog
        if [ -s $errlog ]; then
            failure $"$base startup"
        else
            success $"$base startup"
            touch $subsys
        fi
        echo
    else
       echo $"cannot start $prog, $prog is already running" 
    fi
    return
}

stop() {
    # Stop the BB system.
    echo -n $"Stopping $prog: "
    $bbctl services.stop > $bblog 2> $errlog
    su bbuser -c "killall -KILL java"
    if [ -s $errlog ] ; then
      failure $"$base shutdown"
    else
      success $"$base shutdown"
      rm -f $subsys
    fi
    echo
    return
}

restart() {
    # Restart the BB system.
    echo -n $"Restarting $prog: "
#    $bbctl services.restart > $bblog 2> $errlog
    $bbctl services.stop > $bblog 2> $errlog
    su bbuser -c "killall -KILL java"
    $bbctl services.start > $bblog 2> $errlog
    if [ -s $errlog ] ; then
      failure $"$base restart"
    else
      success $"$base restart"
      touch $subsys
    fi
    echo
    return
}

status() {
# 2008-02-21 jhoeke
# ugly hack to get 3 retries before blackboard is declared dead
# this is used in the heartbeat monitor to restart heartbeat after 3 tries
     URL=http://localhost/webapps/login/
     TIMEOUT=18
     RETRIES=3
     TRY=1
     MONLOG=/var/tmp/R_bb_monitor.log
     if [ -f $MONLOG ] && [ `wc -l $MONLOG | awk '{print $1}'` -ne 5 ]; then
        # something other than normal in the log
        cat $MONLOG | mailx -s "`date` monlog wierdness" [EMAIL PROTECTED]
        mv $MONLOG $MONLOG.`date +%Y%m%d%H%M`
     fi
     echo `date` start monitor op > $MONLOG
     if [ -f $subsys ]; then
         while [ $TRY -le $RETRIES ]; do
             echo `date` start wget op >> $MONLOG
             echo "foo `wget -T $TIMEOUT -q $URL -O - `"| grep -q "Blackboard 
Learning System"
             echo `date` end wget op >> $MONLOG
             EXIT=$?
             if [ $EXIT -eq 0 ]; then
                 echo `date` $prog is running
                 echo `date` $prog is running >> $MONLOG
                 echo `date` end monitor op>> $MONLOG
                  return 0
              else
            # RETRY
                  echo `date` retrying after $TIMEOUT seconds
                  echo `date` wget or grep ERROR monitoring $URL >> $MONLOG
                  echo `date` retrying after $TIMEOUT seconds >> $MONLOG
                  sleep $TIMEOUT
                  let TRY=TRY+1
             fi
         done
     fi
     echo `date` $prog is stopped
     echo `date` $prog is stopped >> $MONLOG
     echo `date` end monitor op >> $MONLOG
     return 3

}

# Remove possible stale log files.
rm -f $bblog
rm -f $errlog

MAILTO="[EMAIL PROTECTED]"

case "$1" in
        start)
                start
        ;;      
        stop)
                echo "`date` $1 blackboard $HOSTNAME. if this is news to you, 
please investigate " | mailx -s "`date` $1 blackboard $HOSTNAME" $MAILTO
                stop
        ;;
        restart)
                echo "`date` $1 blackboard $HOSTNAME. if this is news to you, 
please investigate " | mailx -s "`date` $1 blackboard $HOSTNAME" $MAILTO
                restart
        ;;
        status)
                status
                exit $?
        ;;
        *)
                echo $"Usage: $0 {start|stop|restart|status}"
                exit 1
esac

# Remove the log files after they have been used. If you prefer, you may
# implement some form of rotation here, but then you should not put the
# files in /tmp/.
# Remove disables for testing 
#rm -f $bblog
#rm -f $errlog

# Nobody seems to be looking at this exit code anyway.
exit 0

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to