Hi,

Our system setup:

Heartbeat 3.0.3
DRBD (to manage file system and it is one of the resource managed by CRM)
Redhat Linux
Pacemaker

We have built an application on top of Linux-HA for users to configure
cluster by giving IP addresses of the nodes, do operations like Restart
system, Change host names, Resolve split-brain scenario etc.
In our application, we ran into problem when we do "heartbeat restart" for
some operation and then when user does "Restart System" which internally
runs the command "shutdown -r now". I believe this due to heartbeat lsb
script and I have explained the scenario below.

Problem:

In the heartbeat lsb script, restart does not remove and touches the
heartbeat lock file.

On, "heartbeat start", the lsb script starts heartbeat and touches
/var/lock/subsys/heartbeat lock file.

On, "heartbeat stop", the lsb script stops heartbeat and removes the lock
file at /var/lock/subsys/heartbeat.

On, "heartbeat restart", the lsb script stops heartbeat and starts
heartbeat. But DOES NOT remove or touches the lock file.

We call "heartbeat restart" instead of "heartbeat start" through our script
because we are not sure whether heartbeat is already running or not. So when
"heartbeat restart" is called when heartbeat is NOT running, heartbeat lsb
script tries to stop but its not running so it just starts heartbeat BUT
after starting, heartbeat lock file is not touched (because of restart in
heartbeat lsb). So now, in the system heartbeat is running (can verify this
by looking for heartbeat process or "heartbeat status" command) but there is
no /var/lock/subsys/heartbeat lock file. This lock file is used by the Linux
kernal to know what all process it has to stop when it shuts down (shutdown
-r now). When we run "shutdown -r now", Linux kernal thinks heartbeat is not
running (because there is no lock file) and does not stop heartbeat
properly. When it comes back up, heartbeat is started but heartbeat state is
not correct (because it was not stopped properly).
Due to this, this node is identifies as Primary though the erstwhile
Secondary node has become Primary now and this causes split-brain.

So I believe, "heartbeat restart" should do exactly as "heartbeat stop and
heartbeat start" which is not the case now.
Can you please let me know if my understanding is correct and it is a bug in
Heartbeat lsb script? Thanks for looking into it.

I have given below the relevant code from heartbeat lsb script as well"

File: /etc/init.d/heartbeat

  start)
        RunStartStop pre-start
        StartHA
        RC=$?
        echo
        if
          [ $RC -eq 0 ]
        then
          [ ! -d $LOCKDIR ] && mkdir -p $LOCKDIR
          touch $LOCKDIR/$SUBSYS
        fi
        RunStartStop post-start $RC
        ;;

  stop)
        RunStartStop "pre-stop"
        StopHA
        RC=$?
        echo
        if
          [ $RC -eq 0 ]
        then
          rm -f $LOCKDIR/$SUBSYS
        fi
        RunStartStop post-stop $RC
        ;;

  restart)
        sleeptime=`ha_parameter deadtime`
        StopHA
        echo
        echo -n "Waiting to allow resource takeover to complete:"
        sleep $sleeptime
        sleep 10 # allow resource takeover to complete (hopefully).
        echo_success
        echo
        StartHA
        echo
        ;;
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to