Hi all,

Over the weekend we twice lost power to the primary CAS in our HA-pair.  In
looking at the event logs, it appears that failover occurred properly;
however, the service was non-operational after the second power outage.  The
CASes were rebooted yesterday in my absence in an attempt to return them to
service.  Unfortunately, I wasn't here yesterday to witness their states
before they were rebooted.

Anyhow, my question to the group is what do you feel are best practices for
monitoring the CAS and CAM servers in a HA environment?  Right now I'm only
doing a general SNMP poll from our NMS for system and interface up/down
state.  I would like to know if it makes sense to monitor service and system
stats on each of the servers as well, and which services are critical to the
operation of the system -- in short, how to determine if the system is
operational without first hearing it from end-users.

I'm going to enable traps on the CAM and point them to our NMS and use the
provided MIBS to correlate events.  Unfortunately, there's nothing similar
on the CASes.  (I did enable snmpd on the CASes from the CLI).  Was also
thinking of using Swatch or LogWatch to monitor syslog output from the CAM
(again, no syslog export equivalent on the CAS...)

Anyone doing anything in addition to this?

Thanks!

--
Dave Stempien, Network Security Engineer
University of Rochester Medical Center
Information Systems Division
585-784-2427

Reply via email to