Hi all, Over the weekend we twice lost power to the primary CAS in our HA-pair. In looking at the event logs, it appears that failover occurred properly; however, the service was non-operational after the second power outage. The CASes were rebooted yesterday in my absence in an attempt to return them to service. Unfortunately, I wasn't here yesterday to witness their states before they were rebooted.
Anyhow, my question to the group is what do you feel are best practices for monitoring the CAS and CAM servers in a HA environment? Right now I'm only doing a general SNMP poll from our NMS for system and interface up/down state. I would like to know if it makes sense to monitor service and system stats on each of the servers as well, and which services are critical to the operation of the system -- in short, how to determine if the system is operational without first hearing it from end-users. I'm going to enable traps on the CAM and point them to our NMS and use the provided MIBS to correlate events. Unfortunately, there's nothing similar on the CASes. (I did enable snmpd on the CASes from the CLI). Was also thinking of using Swatch or LogWatch to monitor syslog output from the CAM (again, no syslog export equivalent on the CAS...) Anyone doing anything in addition to this? Thanks! -- Dave Stempien, Network Security Engineer University of Rochester Medical Center Information Systems Division 585-784-2427
