Hi Folks, I just finished constructed a 2-node HA cluster, with this basic configuration:
Debian Lenny Xen xen-3.0-x86_32p (Debian package: 2.6.26-2-xen-686) Disk stack: -- md-based RAID - RAID1 for /boot and Dom0's / and swap; RAID6 for LVM based volumes for DomUs -- LVM -- DRBD 8.0.14 (current package for Debian Lenny) Heartbeat (current package for Debian Lenny - looks like heartbeat 2.1.3), running w/ crm on Several Debian Lenny DomUs (all PVs) - one of which is semi-production, the others are experimental I've pretty much got everything working 3 days ago, and all seems to be working, EXCEPT that 1 or 2 times per day, the system crashes - and the crash of one node seems to take down the 2nd node (not exactly what one wants in failover environment). The nodes both auto-restart, and the production server comes back up, but still... I'd like to track down what's happening. A few data points: - for a couple of crashes, one of the RAID6 arrays started resyncing on reboot -- the LVM and DRBD volumes above it came up during resync, but things were a lot slower; after the most recent crash (about an hour ago), the RAID array was fine after roboot - I've been running sar to track load, and there does not seem to have been any noticeable change in system load, leading up to the crashes (on either Dom0 or on the production DomU) - unfortunately, there was not much in my logs that looked helpful, here's what I've reconstructed This is my backup machine, and it seems to have crashed first - these are the Dom0 syslog entries that bracket the crash and reboot: Jun 8 12:10:01 server3 /USR/SBIN/CRON[25253]: (root) CMD (if [ -x /usr/bin/vnst at ] && [ `ls /var/lib/vnstat/ | wc -l` -ge 1 ]; then /usr/bin/vnstat -u; fi) Jun 8 12:11:42 server3 smartd[4216]: Device: /dev/sda, SMART Usage Attribute: 1 94 Temperature_Celsius changed from 116 to 117 Jun 8 12:11:42 server3 smartd[4216]: Device: /dev/sdb, SMART Usage Attribute: 1 94 Temperature_Celsius changed from 121 to 122 Jun 8 12:11:42 server3 smartd[4216]: Device: /dev/sdc, SMART Usage Attribute: 1 94 Temperature_Celsius changed from 118 to 119 Jun 8 12:11:42 server3 smartd[4216]: Device: /dev/sdc, SMART Usage Attribute: 1 94 Temperature_Celsius changed from 118 to 119 ------ crash seems to have happened here ---------- Jun 8 12:15:46 server3 kernel: imklog 3.18.6, log source = /proc/kmsg started. These are the syslog entries from Dom0 on the "production server" 12:14:11-21 <lots of kernel messages re. DRBD losing connection, changing device states> 12:14:40 - <bunches of messages from hearbeat, crmd, cib - ending with the next three lines> Jun 8 12:14:41 server2 cib: [4676]: info: cib_ccm_msg_callback: LOST: server3 Jun 8 12:14:41 server2 cib: [4676]: info: cib_ccm_msg_callback: PEER: server2 12:14:41 server2 crmd: [4680]: info: do_election_count_vote: Updated vote d hash for server2 to vote ------ looks like the primary node crashed here -------- Jun 8 12:16:14 server2 kernel: imklog 3.18.6, log source = /proc/kmsg started. So..... looks like something happened on my backup node, heartbeat noticed it properly on the primary node, but instead of simply continuing along, it crashed and restarted. At that point everything came back up, but..... So... several questions to the group: 1. Any thoughts on why the crash of one node led to the other node crashing? 1.a. anything I might look at to glean more details (though the logs seem sort of sparse) 1.b. any kind of logging and/or diagnostics I should turn on to capture more details the next time around? 2. Not quite a heartbeat question, but any thoughts on diagnostics I can turn on to try to capture the original crash event? 3. Right now, my production server (DomU) will normally run on one server, then come up on the backup server if the primary server fails. But... as soon as the primary server comes back up, the DomU migrates back - and in these events, the timing is such that it only partially comes up on the backup server before the migration back starts. Somehow this doesn't seem that healthy. So.... 3.a. How do I set things so that, after a primary-node crash, the DomU comes up on the backup machine, and stays there. 3.b. As above, but if the backup node fails, and the primary node comes backup, it goes back. I.e., the desired state is: run where you are, migrate on a crash, stay there unless that node crashes or you're told to migrate Thanks very much, Miles Fidelman -- In theory, there is no difference between theory and practice. In<fnord> practice, there is. .... Yogi Berra _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
