Hi,

On Mon, Apr 28, 2008 at 08:18:32AM +0200, Johan Hoeke wrote:
> Hi All,
> 
> We're running a 2 node heartbeat-2.1.3-3.el4.centos cluster. One of our
> resources look like this:
> 
> crm_resource -x -r R_blackboard_init
> R_blackboard_init     (lsb:blackboard):       Started julia.uvt.nl
> raw xml:
>  <primitive class="lsb" type="blackboard" provider="heartbeat"
> id="R_blackboard_init">
>    <instance_attributes id="R_blackboard_init_instance_attrs">
>      <attributes/>
>    </instance_attributes>
>    <operations>
>      <op name="start" start_delay="0" disabled="false" role="Started"
> id="39658f92-1bce-4429-9df0-06ca53a31477" timeout="180"/>
>      <op name="monitor" role="Started" start_delay="20s" timeout="90s"
> interval="120s" disabled="false"
> id="fbf56ef2-04ab-4e9f-9626-0a6d660dc0f0" on_fail="restart"/>
>    </operations>
>    <instance_attributes id="R_blackboard_init">
>      <attributes>
>        <nvpair name="is_managed" id="R_blackboard_init-is_managed"
> value="true"/>
>        <nvpair name="target_role" id="R_blackboard_init-target_role"
> value="started"/>
>      </attributes>
>    </instance_attributes>
>    <meta_attributes id="R_blackboard_init_meta_attrs">
>      <attributes/>
>    </meta_attributes>
>  </primitive>
> 
> The status part of the attached lsb blackboard init script actually does
> something w/ wget:
> 
> echo "foo `wget -T $TIMEOUT -q $URL -O - `"| grep -q "Blackboard
> Learning System"
> 
> in order to check that the blackboard app is still up.
> 
> Here's my problem:
> 
> Every so often, last occurrence 60 hours after restarting the monitor
> operation, the monitor operation will timeout for no apparent reason and
> restart the application.
> 
> Apr 27 00:18:45 julia lrmd: [2362]: WARN: R_blackboard_init:monitor
> process (PID 19713) timed out (try 1).  Killing with signal SIGTERM (15)
> .
> Apr 27 00:18:45 julia lrmd: [2362]: WARN: Managed
> R_blackboard_init:monitor process 19713 killed by signal 15 [SIGTERM -
> Termination (ANSI)]
> .
> Apr 27 00:18:45 julia lrmd: [2362]: WARN: operation monitor[84] on
> lsb::blackboard::R_blackboard_init for client 2365, its parameters: is_ma
> naged=[true] target_role=[started] CRM_meta_interval=[120000]
> CRM_meta_role=[Started] CRM_meta_start_delay=[20000]
> CRM_meta_id=[fbf56ef2-04a
> b-4e9f-9626-0a6d660dc0f0] CRM_meta_on_fail=[restart]
> CRM_meta_timeout=[110000] crm_feature_set=[2.0]
> CRM_meta_disabled=[false] CRM_meta_name
> =[monitor] : pid [19713] timed out
> Apr 27 00:18:45 julia crmd: [2365]: ERROR: process_lrm_event: LRM
> operation R_blackboard_init_monitor_120000 (84) Timed Out (timeout=110000m
> s)
> 
> In order to debug this I have added some extra logs. This is log of a
> successful monitor operation:
> 
> Mon Apr 28 08:01:37 CEST 2008 start monitor op
> Mon Apr 28 08:01:37 CEST 2008 start wget op
> Mon Apr 28 08:01:37 CEST 2008 end wget op
> Mon Apr 28 08:01:37 CEST 2008 blackboard is running
> Mon Apr 28 08:01:37 CEST 2008 end monitor op
> (end)
> 
> This is the log of the monitor operation that timed out:
> 
> Sun Apr 27 00:16:55 CEST 2008 start monitor op
> Sun Apr 27 00:16:55 CEST 2008 start wget op
> (end)
> 
> It looks to me like the monitor operation just dies silently after
> starting the wget. Then after 120s the timeout occurs and the restart is
> triggered.

> Anybody have an idea why this occurs?

The operation definitely times out, probably because wget blocks
(what's your $TIMEOUT set to?). The monitor operation didn't
disappear and wouldn't without lrmd noticing. All processes are
managed.

If you want to debug the script, use set -x to see what's going
on.

Thanks,

Dejan

> thanks for your time,
> 
> regards,
> 
> Johan

> #! /bin/sh
> 
> # $Id: blackboard 25936 2008-04-22 08:11:09Z jhoeke $
> # $URL: 
> https://its-unix-vc.uvt.nl/its-unix/group/bb7appprod/etc/init.d/blackboard $
> 
> # Init script for Blackboard. Calls official BB code to start, stop, or
> # restart the complete system. Uses hack to find out whether the official
> # stuff produced an error somewhere. In case of error, mails root the
> # complete log file.
> #
> # JH 2003-01-28 Added proper subsys code to help stopping by init.
> # JH 2003-01-20 Initial version.
> #
> # chkconfig: 35 95 05
> # description: Starts/stops/restarts the complete Blackboard server system, \
> #             using the official Blackboard start/stop commands internally.
> 
> ##### CONFIG PART #########################################################
> 
> # Some local constants. Log file places may be changed; log files are
> # not retained in between invocations of this script anyway (they are
> # always cleaned out when the script finishes).
> prog="blackboard"
> subsys="/var/lock/subsys/$prog"
> bbctl="/usr/local/blackboard/tools/admin/ServiceController.sh"
> bblog="/tmp/bbctl.log"
> errlog="/tmp/bbctl-error.log"
> 
> ##### CODE BEGINS HERE ####################################################
> 
> # Source standard function library.
> . /etc/rc.d/init.d/functions
> 
> start() {
>     if [ ! -f $subsys ]; then
>     # Start the BB system.
>       echo -n $"Starting $prog: "
>       echo "`date` start blackboard $HOSTNAME. if this is news to you, please 
> investigate " | mailx -s "`date` start blackboard $HOSTNAME" $MAILTO
>       $bbctl services.start > $bblog 2> $errlog
>       if [ -s $errlog ]; then
>             failure $"$base startup"
>         else
>             success $"$base startup"
>             touch $subsys
>         fi
>         echo
>     else
>        echo $"cannot start $prog, $prog is already running" 
>     fi
>     return
> }
> 
> stop() {
>     # Stop the BB system.
>     echo -n $"Stopping $prog: "
>     $bbctl services.stop > $bblog 2> $errlog
>     su bbuser -c "killall -KILL java"
>     if [ -s $errlog ] ; then
>       failure $"$base shutdown"
>     else
>       success $"$base shutdown"
>       rm -f $subsys
>     fi
>     echo
>     return
> }
> 
> restart() {
>     # Restart the BB system.
>     echo -n $"Restarting $prog: "
> #    $bbctl services.restart > $bblog 2> $errlog
>     $bbctl services.stop > $bblog 2> $errlog
>     su bbuser -c "killall -KILL java"
>     $bbctl services.start > $bblog 2> $errlog
>     if [ -s $errlog ] ; then
>       failure $"$base restart"
>     else
>       success $"$base restart"
>       touch $subsys
>     fi
>     echo
>     return
> }
> 
> status() {
> # 2008-02-21 jhoeke
> # ugly hack to get 3 retries before blackboard is declared dead
> # this is used in the heartbeat monitor to restart heartbeat after 3 tries
>      URL=http://localhost/webapps/login/
>      TIMEOUT=18
>      RETRIES=3
>      TRY=1
>      MONLOG=/var/tmp/R_bb_monitor.log
>      if [ -f $MONLOG ] && [ `wc -l $MONLOG | awk '{print $1}'` -ne 5 ]; then
>         # something other than normal in the log
>       cat $MONLOG | mailx -s "`date` monlog wierdness" [EMAIL PROTECTED]
>         mv $MONLOG $MONLOG.`date +%Y%m%d%H%M`
>      fi
>      echo `date` start monitor op > $MONLOG
>      if [ -f $subsys ]; then
>          while [ $TRY -le $RETRIES ]; do
>              echo `date` start wget op >> $MONLOG
>              echo "foo `wget -T $TIMEOUT -q $URL -O - `"| grep -q "Blackboard 
> Learning System"
>              echo `date` end wget op >> $MONLOG
>              EXIT=$?
>              if [ $EXIT -eq 0 ]; then
>                  echo `date` $prog is running
>                  echo `date` $prog is running >> $MONLOG
>                  echo `date` end monitor op>> $MONLOG
>                   return 0
>               else
>             # RETRY
>                   echo `date` retrying after $TIMEOUT seconds
>                   echo `date` wget or grep ERROR monitoring $URL >> $MONLOG
>                   echo `date` retrying after $TIMEOUT seconds >> $MONLOG
>                   sleep $TIMEOUT
>                   let TRY=TRY+1
>              fi
>          done
>      fi
>      echo `date` $prog is stopped
>      echo `date` $prog is stopped >> $MONLOG
>      echo `date` end monitor op >> $MONLOG
>      return 3
> 
> }
> 
> # Remove possible stale log files.
> rm -f $bblog
> rm -f $errlog
> 
> MAILTO="[EMAIL PROTECTED]"
> 
> case "$1" in
>       start)
>               start
>       ;;      
>       stop)
>               echo "`date` $1 blackboard $HOSTNAME. if this is news to you, 
> please investigate " | mailx -s "`date` $1 blackboard $HOSTNAME" $MAILTO
>               stop
>       ;;
>       restart)
>               echo "`date` $1 blackboard $HOSTNAME. if this is news to you, 
> please investigate " | mailx -s "`date` $1 blackboard $HOSTNAME" $MAILTO
>               restart
>       ;;
>       status)
>               status
>               exit $?
>       ;;
>       *)
>               echo $"Usage: $0 {start|stop|restart|status}"
>               exit 1
> esac
> 
> # Remove the log files after they have been used. If you prefer, you may
> # implement some form of rotation here, but then you should not put the
> # files in /tmp/.
> # Remove disables for testing 
> #rm -f $bblog
> #rm -f $errlog
> 
> # Nobody seems to be looking at this exit code anyway.
> exit 0




> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to