[ha-clusters-discuss] Problems starting resource

Thorsten Frueauf Sat, 31 Jan 2009 20:31:02 +0100

Hi Lalith et al,

you need to also look at /var/adm/messages on all nodes to see what 
specifically is going on with your resource.


You can increase the level of debug informations by changing the 
following line in /etc/syslog.conf on all nodes:

*.err;kern.debug;daemon.notice;mail.crit        /var/adm/messages

to

*.err;kern.debug;daemon.debug;mail.crit        /var/adm/messages

Afterwards restart syslog by

# svcadm restart system-log

Further, in your <agent directory>/etc/config you can change

DEBUG=

to

DEBUG=ALL

Since you use GDS, check that the start command does leave a process 
running for PMF. You might want to have a look at the Child_mon_level 
property if the pids being left are only childs.

If the start method does not leave a process behind, you need to disable 
the pmf action script, see
http://src.opensolaris.org/source/xref/ha-utilities/GDS-template/SUNCscxxx/bin/functions#222
http://src.opensolaris.org/source/xref/ha-utilities/GDS-template/SUNCscxxx/bin/functions#326

If you see "Failed to saty up" messages from pmf, then it would be an 
indication. Of course it is hard to tell without seeing your code and 
knowing what you want to achieve.

The success messages in the logfiles you look at just indicate that the 
methods did return 0 (=success) - but this does not mean they run as 
desired.

Greets
       Thorsten

Lalith Suresh wrote:
> Hi all,
> 
> I'm almost done with the coding part as far as HA-Cron is concerned and 
> I'd tested the start, stop, probe and validate scripts individually. 
> They seemed to work fine. I then went on to run Make_Package to get the 
> pkg, and then installed it on my single node cluster. But whenever I 
> start the RG containing HA-Cron, it goes online for an instant, then 
> goes offline. I have no clue what's going on. Please help. Here are some 
> messages that'll help.
> 
> *
> *The first time I run this, I don't get any errors
> 
> *bash-3.2# clrg online test
> 
> *In about 3 seconds*
> 
> bash-3.2# cluster status
> 
> === Cluster Nodes ===
> 
> --- Node Status ---
> 
> Node Name                                       Status
> ---------                                       ------
> irule                                           Online
> 
> 
> === Cluster Transport Paths ===
> 
> Endpoint1               Endpoint2               Status
> ---------               ---------               ------
> 
> 
> === Cluster Quorum ===
> 
> --- Quorum Votes Summary ---
> 
>             Needed   Present   Possible
>             ------   -------   --------
>             1        1         1
> 
> 
> --- Quorum Votes by Node ---
> 
> Node Name       Present       Possible       Status
> ---------       -------       --------       ------
> irule           1             1              Online
> 
> 
> === Cluster Device Groups ===
> 
> --- Device Group Status ---
> 
> Device Group Name     Primary     Secondary     Status
> -----------------     -------     ---------     ------
> 
> 
> --- Spare, Inactive, and In Transition Nodes ---
> 
> Device Group Name   Spare Nodes   Inactive Nodes   In Transistion Nodes
> -----------------   -----------   --------------   --------------------
> 
> 
> --- Multi-owner Device Group Status ---
> 
> Device Group Name           Node Name           Status
> -----------------           ---------           ------
> 
> === Cluster Resource Groups ===
> 
> Group Name       Node Name       Suspended      State
> ----------       ---------       ---------      -----
> test             irule           No             Offline
> 
> 
> === Cluster Resources ===
> 
> Resource Name       Node Name      State        Status Message
> -------------       ---------      -----        --------------
> node                irule          Offline      Offline - 
> LogicalHostname offline.
> 
> SUNCsccron          irule          Offline      Offline
> 
> 
> === Cluster DID Devices ===
> 
> Device Instance               Node              Status
> ---------------               ----              ------
> /dev/did/rdsk/d1              irule             Ok
> 
> 
> === Zone Clusters ===
> 
> --- Zone Cluster Status ---
> 
> Name    Node Name    Zone HostName    Status    Zone Status
> 
> 
> *Here's the log file for that run*:
> 
> **/var/cluster/logs/DS/test/SUNCsccron/start_stop_log.txt*
> 
> 01/31/2009 16:47:38 irule STOP-INFO> Stop succeeded 
> [/opt/SUNCsccron/bin/control_cron -R SUNCsccron -G test -C 
> /var/spool/cron/crontabs/xyz stop].
> 01/31/2009 16:47:38 irule STOP-INFO> Successfully stopped the application
> 01/31/2009 16:47:38 irule --INFO> Validate has been executed 
> [/opt/SUNCsccron/bin/control_cron -R SUNCsccron -G test -C 
> /var/spool/cron/crontabs/xyz validate exited with status 0]
> 01/31/2009 16:47:38 irule START-INFO> Start succeeded. 
> [/opt/SUNCsccron/bin/control_cron -R SUNCsccron -G test -C 
> /var/spool/cron/crontabs/xyz start]
> 01/31/2009 16:47:38 irule STOP-INFO> Stop succeeded 
> [/opt/SUNCsccron/bin/control_cron -R SUNCsccron -G test -C 
> /var/spool/cron/crontabs/xyz stop].
> 01/31/2009 16:47:38 irule STOP-INFO> Successfully stopped the application
> 01/31/2009 16:47:39 irule --INFO> Validate has been executed 
> [/opt/SUNCsccron/bin/control_cron -R SUNCsccron -G test -C 
> /var/spool/cron/crontabs/xyz validate exited with status 0]
> 01/31/2009 16:47:39 irule START-INFO> Start succeeded. 
> [/opt/SUNCsccron/bin/control_cron -R SUNCsccron -G test -C 
> /var/spool/cron/crontabs/xyz start]
> 01/31/2009 16:47:39 irule STOP-INFO> Stop succeeded 
> [/opt/SUNCsccron/bin/control_cron -R SUNCsccron -G test -C 
> /var/spool/cron/crontabs/xyz stop].
> 01/31/2009 16:47:39 irule STOP-INFO> Successfully stopped the application
> *
> 
> *But when I run it again, I get this,*
> 
> **bash-3.2# clrg online test*
> *clrg:  (C748634) Resource group test failed to start on chosen node and 
> might fail over to other node(s)
> clrg:  (C135343) No primary node could be found for resource group test; 
> it remains offline*
> 
> 
> And another weird thing is that even thought the log file says it 
> successfully stopped the application, it isn't producing the expected 
> results either. (once HA-Cron stops, it's supposed to return the root 
> crontab file back to the way it was before the RG was started, by making 
> use of a backup. Although the backup is removed, the original crontab 
> isn't being restored.)
> 
> 
> -- 
> Lalith Suresh
> Department of Computer Engineering
> Malaviya National Institute of Technology, Jaipur
> +91-9982190365 , lalithsuresh.wordpress.com 
> <http://lalithsuresh.wordpress.com>

-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   Sitz der Gesellschaft:
   Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
   Amtsgericht Muenchen: HRB 161028
   Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
   Vorsitzender des Aufsichtsrates: Martin Haering
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[ha-clusters-discuss] Problems starting resource

Reply via email to