[ha-clusters-discuss] Starting timeout -> pingpong

Denny Schierz Fri, 07 Jan 2011 05:41:36 -0800

hi,

I've created a short script for our megaraid controller:


===================
#!/bin/bash
# Flush and load controller configuration

PATH="/usr/sbin:/usr/bin:/usr/sfw/bin/:/opt/SUNWcluster/bin/:/usr/cluster/bin:/opt/MegaRAID/CLI:/opt/csw/bin:/usr/sfw/bin/:/opt/SUNWcluster/bin/:/usr/cluster/bin:/opt/csw/sbin/:/opt/csw/bin/"
MEGA="/opt/MegaRAID/CLI/MegaCli"
CONFIG="/etc/megaraid/cfg/megaraid-config.conf"
CNTRL="-a0"

case "$1" in

        import)
        # restore config and disks are shown in format 
                $MEGA -CfgForeign -Scan $CNTRL
                $MEGA -CfgForeign -Clear $CNTRL
                $MEGA -CfgClr $CNTRL
                sleep 3
                $MEGA -CfgRestore  -f $CONFIG $CNTRL
                # wartezeit
                sleep 10
        ;;
        
        clear)
        # flush config and all disks are gone in format
                $MEGA -CfgClr $CNTRL
                sleep 2
        ;;

        *)
                echo "Usage: $0 import|clear"
        ;;
esac

exit 0
==================

Create the RG:

# clrg create -n iscsihead-m,iscsihead-s megaraid-switch-rg

Create the RS:

# clrs create -g megaraid-switch-rg -t SUNW.gds -p \
Start_command="/root/bin/megaraid-config import" -p \
Stop_command="/root/bin/megaraid-config clear" -p \
Probe_command=/bin/true -p Network_aware=false -p Log_level=ERR \
megaraid-switch-rs


# clrg online -M megaraid-switch-rg

Adding the disks (48) takes up to ~ 30 seconds, but with unknown reason,
the cluster wants to stop and restarting:

iscsihead-m -> master node
iscsihead-s -> failover node

=========================================0
[...]

Jan  7 13:33:03 iscsihead-s Cluster.RGM.global.rgmd: [ID 515159
daemon.notice] method <gds_svc_start> completed successfully for
resource <megaraid-switch-rs>, resource group <megaraid-switch-rg>, node
<iscsihead-s>, time used: 3% of timeout <300 seconds>
Jan  7 13:33:03 iscsihead-s Cluster.RGM.global.rgmd: [ID 443746
daemon.notice] resource megaraid-switch-rs state on node iscsihead-s
change to R_ONLINE_UNMON
Jan  7 13:33:03 iscsihead-s Cluster.RGM.global.rgmd: [ID 784560
daemon.notice] resource megaraid-switch-rs status on node iscsihead-s
change to R_FM_ONLINE
Jan  7 13:33:03 iscsihead-s Cluster.RGM.global.rgmd: [ID 922363
daemon.notice] resource megaraid-switch-rs status msg on node
iscsihead-s change to <>
Jan  7 13:33:03 iscsihead-s Cluster.RGM.global.rgmd: [ID 224900
daemon.notice] launching method <gds_monitor_start> for resource
<megaraid-switch-rs>, resource group <megaraid-switch-rg>, node
<iscsihead-s>, timeout <300> seconds
Jan  7 13:33:03 iscsihead-s Cluster.RGM.global.rgmd: [ID 515159
daemon.notice] method <gds_monitor_start> completed successfully for
resource <megaraid-switch-rs>, resource group <megaraid-switch-rg>, node
<iscsihead-s>, time used: 0% of timeout <300 seconds>
Jan  7 13:33:03 iscsihead-s Cluster.RGM.global.rgmd: [ID 443746
daemon.notice] resource megaraid-switch-rs state on node iscsihead-s
change to R_ONLINE
Jan  7 13:33:03 iscsihead-s Cluster.RGM.global.rgmd: [ID 529407
daemon.notice] resource group megaraid-switch-rg state on node
iscsihead-s change to RG_ONLINE
Jan  7 13:33:07 iscsihead-s Cluster.PMF.pmfd: [ID 887656 daemon.notice]
Process: tag="megaraid-switch-rg,megaraid-switch-rs,0.svc",
cmd="/bin/ksh -c /root/bin/megaraid-config import", Failed to stay up.
Jan  7 13:33:07 iscsihead-s Cluster.RGM.global.rgmd: [ID 784560
daemon.notice] resource megaraid-switch-rs status on node iscsihead-s
change to R_FM_FAULTED
Jan  7 13:33:07 iscsihead-s Cluster.RGM.global.rgmd: [ID 922363
daemon.notice] resource megaraid-switch-rs status msg on node
iscsihead-s change to <Service daemon not running.>
Jan  7 13:33:07 iscsihead-s
SC[,SUNW.gds:6,megaraid-switch-rg,megaraid-switch-rs,gds_probe]: [ID
423137 daemon.error] A resource restart attempt on resource
megaraid-switch-rs in resource group megaraid-switch-rg has been blocked
because the number of restarts within the past Retry_interval (370
seconds) would exceed Retry_count (2)
Jan  7 13:33:07 iscsihead-s
SC[,SUNW.gds:6,megaraid-switch-rg,megaraid-switch-rs,gds_probe]: [ID
874133 daemon.notice] Issuing a failover request because the application
exited.
Jan  7 13:33:07 iscsihead-s Cluster.RGM.global.rgmd: [ID 494478
daemon.notice] resource megaraid-switch-rs in resource group
megaraid-switch-rg has requested failover of the resource group on
iscsihead-s.
Jan  7 13:33:07 iscsihead-s Cluster.RGM.global.rgmd: [ID 423291
daemon.error] RGM isn't failing resource group <megaraid-switch-rg> off
of node <iscsihead-s>, because there are no other current or potential
masters
Jan  7 13:33:07 iscsihead-s Cluster.RGM.global.rgmd: [ID 702911
daemon.error] Resource <megaraid-switch-rs> of Resource Group
<megaraid-switch-rg> failed pingpong check on node <iscsihead-m>.  The
resource group will not be mastered by that node.
Jan  7 13:33:07 iscsihead-s
SC[,SUNW.gds:6,megaraid-switch-rg,megaraid-switch-rs,gds_probe]: [ID
969827 daemon.error] Failover attempt has failed.
Jan  7 13:33:07 iscsihead-s
SC[,SUNW.gds:6,megaraid-switch-rg,megaraid-switch-rs,gds_probe]: [ID
670283 daemon.notice] Issuing a resource restart request because the
application exited.
Jan  7 13:33:07 iscsihead-s Cluster.RGM.global.rgmd: [ID 494478
daemon.notice] resource megaraid-switch-rs in resource group
megaraid-switch-rg has requested restart of the resource on iscsihead-s.
Jan  7 13:33:07 iscsihead-s Cluster.RGM.global.rgmd: [ID 471587
daemon.notice] Resource <megaraid-switch-rs> is restarting too often on
<iscsihead-s>. Sleeping for <15> seconds.

====================================

I also thought, maybe reading $1 for my case script is the problem, so I
split the script in two files, so that I doesn't need $1, but it's the
same: First the script starts, some disks are added and while the
megaraid cli is running (and adding more disks), the cluster calls the
stop command. ...


method <gds_svc_start> for resource <megaraid-switch-rs>, resource group
<megaraid-switch-rg>, node <iscsihead-s>, timeout <300> seconds
Jan  7 14:33:32 iscsihead-s Cluster.RGM.global.rgmd: [ID 784560
daemon.notice] resource megaraid-switch-rs status on node iscsihead-s
change to R_FM_UNKNOWN
Jan  7 14:33:32 iscsihead-s Cluster.RGM.global.rgmd: [ID 922363
daemon.notice] resource megaraid-switch-rs status msg on node
iscsihead-s change to <Starting>

execute the script:

# clrg offline megaraid-switch-rg
root(iscsihead-s):~# /bin/ksh -c /root/bin/megaraid-import
~ 20 seconds later ...
root(iscsihead-s):~# echo $?
0


where is my error? 

cu denny

signature.asc
Description: This is a digitally signed message part

_______________________________________________
ha-clusters-discuss mailing list
ha-clusters-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/ha-clusters-discuss

[ha-clusters-discuss] Starting timeout -> pingpong

Reply via email to