On Sat, Dec 28, 2013 at 12:42:28AM PST, Jefferson Ogata spake thusly:
> Is it possible that it's a coincidence of log rotation after patching? In
> certain circumstances i've had library replacement or subsequent prelink
> activity on libraries lead to a crash of some services during log rotation.
> This hasn't happened to me with pacemaker/cman/corosync, but it might
> conceivably explain why it only happens to you once in a while.

I just caught the cluster in the middle of crashing again and noticed it had a
system load of 9. Although it isn't clear why. A backup was running but after
the cluster failed over the backup continues and the load went to very nearly
zero. So it doesn't seem like the backup was causing the issue. But the system
was noticeably performance impacted. I've never noticed this situation before.

One thing I really need to learn more about is how the cluster knows when
something has failed and it needs to fail over. I first setup a linux-ha
firewall cluster back around 2001 and we used simple heartbeat and some scripts
to pass around IP addresses and start/stop the firewall. It would ping its
upstream gateway and communicate with its partner via a serial cable. If the
active node couldn't ping its upstream it killed the local heartbeat and the
partner took over. If the active node wasn't sending heartbeats the passive
node took over. Once working it stayed working and was much much simpler than
the current arrangement.

I have no idea how the current system actually communicates or what the
criteria for failover really are.

What are the chances that the box gets overloaded and drops a packet and the
partner takes over?

What if I had an IP conflict with another box on the network and one of my VIP
IP addresses didn't behave as expected?

What would any of these look like in the logs? One of my biggest difficulties
in diagnosing this is that the logs are huge and noisy. It is hard to tell what
is normal, what is an error, and what is the actual test that failed which
caused the failover.

> You might take a look at the pacct data in /var/account/ for the time of the
> crash; it should indicate exit status for the dying process as well as what
> other processes were started around the same time.

Process accounting wasn't running but /var/log/audit/audit.log is which has the
same info. What dying process are we talking about here? I haven't been able to
identify any processes which died.

> Yes, you're supposed to switch to cman. Not sure if it's related to your
> problem, tho.

I suspect the cman issue is unrelated so I'm not going to mess with it until I
get the current issue figure out. I've had two more crashes since I started
this thread: One around 3am and one just this afternoon around 1pm. A backup
was running but after the cluster failed over the backup kept running and the
load returned to normal (practically zero).

-- 
Tracy Reed

Attachment: pgpQD3Hl7rhb5.pgp
Description: PGP signature

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to