On Sat, Dec 28, 2013 at 12:42:28AM PST, Jefferson Ogata spake thusly: > Is it possible that it's a coincidence of log rotation after patching? In > certain circumstances i've had library replacement or subsequent prelink > activity on libraries lead to a crash of some services during log rotation. > This hasn't happened to me with pacemaker/cman/corosync, but it might > conceivably explain why it only happens to you once in a while.
I just caught the cluster in the middle of crashing again and noticed it had a system load of 9. Although it isn't clear why. A backup was running but after the cluster failed over the backup continues and the load went to very nearly zero. So it doesn't seem like the backup was causing the issue. But the system was noticeably performance impacted. I've never noticed this situation before. One thing I really need to learn more about is how the cluster knows when something has failed and it needs to fail over. I first setup a linux-ha firewall cluster back around 2001 and we used simple heartbeat and some scripts to pass around IP addresses and start/stop the firewall. It would ping its upstream gateway and communicate with its partner via a serial cable. If the active node couldn't ping its upstream it killed the local heartbeat and the partner took over. If the active node wasn't sending heartbeats the passive node took over. Once working it stayed working and was much much simpler than the current arrangement. I have no idea how the current system actually communicates or what the criteria for failover really are. What are the chances that the box gets overloaded and drops a packet and the partner takes over? What if I had an IP conflict with another box on the network and one of my VIP IP addresses didn't behave as expected? What would any of these look like in the logs? One of my biggest difficulties in diagnosing this is that the logs are huge and noisy. It is hard to tell what is normal, what is an error, and what is the actual test that failed which caused the failover. > You might take a look at the pacct data in /var/account/ for the time of the > crash; it should indicate exit status for the dying process as well as what > other processes were started around the same time. Process accounting wasn't running but /var/log/audit/audit.log is which has the same info. What dying process are we talking about here? I haven't been able to identify any processes which died. > Yes, you're supposed to switch to cman. Not sure if it's related to your > problem, tho. I suspect the cman issue is unrelated so I'm not going to mess with it until I get the current issue figure out. I've had two more crashes since I started this thread: One around 3am and one just this afternoon around 1pm. A backup was running but after the cluster failed over the backup kept running and the load returned to normal (practically zero). -- Tracy Reed
pgpQD3Hl7rhb5.pgp
Description: PGP signature
_______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
