[Linux-HA] tracking down reason for crash?

Miles Fidelman Tue, 08 Jun 2010 10:46:26 -0700

Hi Folks,

I just finished constructed a 2-node HA cluster, with this basic 
configuration:


Debian Lenny
Xen xen-3.0-x86_32p (Debian package: 2.6.26-2-xen-686)
Disk stack:
-- md-based RAID - RAID1 for /boot and Dom0's / and swap; RAID6 for LVM 
based volumes for DomUs
-- LVM
-- DRBD 8.0.14 (current package for Debian Lenny)
Heartbeat (current package for Debian Lenny - looks like heartbeat 
2.1.3), running w/ crm on
Several Debian Lenny DomUs (all PVs) - one of which is semi-production, 
the others are experimental

I've pretty much got everything working 3 days ago, and all seems to be 
working, EXCEPT that 1 or 2 times per day, the system crashes - and the 
crash of one node seems to take down the 2nd node (not exactly what one 
wants in failover environment).

The nodes both auto-restart, and the production server comes back up, 
but still... I'd like to track down what's happening.

A few data points:

- for a couple of crashes, one of the RAID6 arrays started resyncing on 
reboot -- the LVM and DRBD volumes above it came up during resync, but 
things were a lot slower; after the most recent crash (about an hour 
ago), the RAID array was fine after roboot

- I've been running sar to track load, and there does not seem to have 
been any noticeable change in system load, leading up to the crashes (on 
either Dom0 or on the production DomU)

- unfortunately, there was not much in my logs that looked helpful, 
here's what I've reconstructed

This is my backup machine, and it seems to have crashed first - these 
are the Dom0 syslog entries that bracket the crash and reboot:

Jun  8 12:10:01 server3 /USR/SBIN/CRON[25253]: (root) CMD (if [ -x 
/usr/bin/vnst
at ] && [ `ls /var/lib/vnstat/ | wc -l` -ge 1 ]; then /usr/bin/vnstat 
-u; fi)
Jun  8 12:11:42 server3 smartd[4216]: Device: /dev/sda, SMART Usage 
Attribute: 1
94 Temperature_Celsius changed from 116 to 117
Jun  8 12:11:42 server3 smartd[4216]: Device: /dev/sdb, SMART Usage 
Attribute: 1
94 Temperature_Celsius changed from 121 to 122
Jun  8 12:11:42 server3 smartd[4216]: Device: /dev/sdc, SMART Usage 
Attribute: 1
94 Temperature_Celsius changed from 118 to 119
Jun  8 12:11:42 server3 smartd[4216]: Device: /dev/sdc, SMART Usage 
Attribute: 1
94 Temperature_Celsius changed from 118 to 119

------ crash seems to have happened here ----------

Jun  8 12:15:46 server3 kernel: imklog 3.18.6, log source = /proc/kmsg 
started.

These are the syslog entries from Dom0 on the "production server"

12:14:11-21 <lots of kernel messages re. DRBD losing connection, 
changing device states>
12:14:40 - <bunches of messages from hearbeat, crmd, cib - ending with 
the next three lines>
Jun  8 12:14:41 server2 cib: [4676]: info: cib_ccm_msg_callback: LOST: 
server3
Jun  8 12:14:41 server2 cib: [4676]: info: cib_ccm_msg_callback: PEER: 
server2
12:14:41 server2 crmd: [4680]: info: do_election_count_vote: Updated vote
d hash for server2 to vote

------ looks like the primary node crashed here --------

Jun  8 12:16:14 server2 kernel: imklog 3.18.6, log source = /proc/kmsg 
started.

So..... looks like something happened on my backup node, heartbeat 
noticed it properly on the primary node, but instead of simply 
continuing along, it crashed and restarted.  At that point everything 
came back up, but.....

So... several questions to the group:

1. Any thoughts on why the crash of one node led to the other node 
crashing?
1.a. anything I might look at to glean more details (though the logs 
seem sort of sparse)
1.b. any kind of logging and/or diagnostics I should turn on to capture 
more details the next time around?

2. Not quite a heartbeat question, but any thoughts on diagnostics I can 
turn on to try to capture the original crash event?

3. Right now, my production server (DomU) will normally run on one 
server, then come up on the backup server if the primary server fails.  
But... as soon as the primary server comes back up, the DomU migrates 
back - and in these events, the timing is such that it only partially 
comes up on the backup server before the migration back starts.  Somehow 
this doesn't seem that healthy.  So....
3.a.  How do I set things so that, after a primary-node crash, the DomU 
comes up on the backup machine, and stays there.
3.b. As above, but if the backup node fails, and the primary node comes 
backup, it goes back.
I.e., the desired state is: run where you are, migrate on a crash, stay 
there unless that node crashes or you're told to migrate

Thanks very much,

Miles Fidelman


-- 
In theory, there is no difference between theory and practice.
In<fnord>  practice, there is.   .... Yogi Berra


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] tracking down reason for crash?

Reply via email to