On Tue, May 11, 2010 at 10:35 PM, Mike Sweetser <[email protected]> wrote: > Hello, > > I've set up a DRBD and Heartbeat configuration communicating over an > Internet connection, rather than internal. The servers are running CentOS > 5.4, with DRBD 8.3.2 and Heartbeat 3.0.3, out of the CentOS repository. > > I start seeing these in the ha-log. > > ERROR: Message hist queue is filling up (500 messages in queue) > > Then I see a bunch of these: > > WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request took > too long to execute: 20 ms (> 10 ms) (GSource: 0x1c3025c0) > > And finally: > > May 08 05:33:19 mysql1 heartbeat: [5536]: CRIT: Killing pid 5533 with > SIGTERM > May 08 05:33:19 mysql1 heartbeat: [5536]: CRIT: Killing pid 5537 with > SIGTERM > May 08 05:33:19 mysql1 heartbeat: [5536]: CRIT: Killing pid 5538 with > SIGTERM > May 08 05:33:19 mysql1 heartbeat: [5536]: CRIT: Killing pid 5539 with > SIGTERM > May 08 05:33:19 mysql1 heartbeat: [5536]: CRIT: Killing pid 5540 with > SIGTERM > May 08 05:33:19 mysql1 heartbeat: [5536]: CRIT: Emergency Shutdown(MCP > dead): Killing ourselves. > > At this point, Heartbeat on MySQL1 is dead, but because it died, it didn't > let the resources go, and DRBD is still mounted on the first server, meaning > the backup can't take over. > > DRBD has continued running, and the latency between servers is very low > (9ms). > > Here's my ha.cf: > > logfacility local0 > debug 1 > debugfile /var/log/ha-debug > logfile /var/log/ha-log > node mysql1 > node mysql2 > keepalive 2 > deadtime 60 > initdead 120 > warntime 15 > udpport 694 > ucast eth1 66.165.231.34 > ucast eth1 67.218.128.19 > auto_failback on > crm yes > > Here's my CRM config: > > node $id="23b44f0c-55fb-4b21-bf2e-81c15f28816d" mysql2 > node $id="96c549d6-3e8c-4f7a-a644-cdc08dd99e41" mysql1 > primitive drbd heartbeat:drbddisk \ > params 1="mysql" \ > op monitor interval="30s" timeout="30s" > primitive fs ocf:heartbeat:Filesystem \ > params fstype="ext3" directory="/mnt/mysql" device="/dev/drbd1" \ > op monitor interval="30s" timeout="40s" > primitive mysql ocf:heartbeat:mysql \ > params binary="/usr/bin/mysqld_safe" datadir="/mnt/mysql" \ > op monitor interval="30s" timeout="40s" > group mysql-group drbd fs mysql > location group-master mysql-group \ > rule $id="group-master-rule" 100: #uname eq mysql1 > property $id="cib-bootstrap-options" \ > dc-version="1.0.8-9881a7350d6182bae9e8e557cf20a3cc5dac3ee7" \ > cluster-infrastructure="Heartbeat" \ > stonith-enabled="false" \ > last-lrm-refresh="1268950841" > > What am I missing?
A reliable (and fast) internet connection combined with very aggressive timeouts in ha.cf _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
