On Wed, May 12, 2010 at 3:16 AM, Lars Ellenberg <[email protected]>wrote:
> On Tue, May 11, 2010 at 01:35:17PM -0700, Mike Sweetser wrote: > > Hello, > > > > I've set up a DRBD and Heartbeat configuration communicating over an > > Internet connection, rather than internal. The servers are running > CentOS > > 5.4, with DRBD 8.3.2 and Heartbeat 3.0.3, out of the CentOS repository. > > > > I start seeing these in the ha-log. > > > > ERROR: Message hist queue is filling up (500 messages in queue) > > > > Then I see a bunch of these: > > > > WARN: Gmain_timeout_dispatch: Dispatch function for retransmit request > took > > too long to execute: 20 ms (> 10 ms) (GSource: 0x1c3025c0) > > > > And finally: > > What is before this? > Below is "MCP dead" (Master Control Process)... > it should log why it died. > Or there should be some core file below > find /var/lib/heartbeat/cores/ > Or both. > > May 11 17:38:33 mysql1 crmd: [904]: notice: run_graph: Transition 1 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-2440.bz2): Complete May 11 17:38:33 mysql1 crmd: [904]: info: te_graph_trigger: Transition 1 is now complete May 11 17:38:33 mysql1 crmd: [904]: info: notify_crmd: Transition 1 status: done - <null> May 11 17:38:33 mysql1 crmd: [904]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] May 11 17:38:33 mysql1 crmd: [904]: info: do_state_transition: Starting PEngine Recheck Timer May 11 17:38:33 mysql1 pengine: [23821]: info: process_pe_message: Transition 1: PEngine Input stored in: /var/lib/pengine/pe-input-2440.bz2 May 11 17:45:28 mysql1 cib: [900]: info: cib_stats: Processed 1 operations (0.00us average, 0% utilization) in the last 10min That's before all those messages. Right before that, it actually said it lost a connection with the other server, but it came back right away. > > May 08 05:33:19 mysql1 heartbeat: [5536]: CRIT: Killing pid 5533 with > > SIGTERM > > May 08 05:33:19 mysql1 heartbeat: [5536]: CRIT: Killing pid 5537 with > > SIGTERM > > May 08 05:33:19 mysql1 heartbeat: [5536]: CRIT: Killing pid 5538 with > > SIGTERM > > May 08 05:33:19 mysql1 heartbeat: [5536]: CRIT: Killing pid 5539 with > > SIGTERM > > May 08 05:33:19 mysql1 heartbeat: [5536]: CRIT: Killing pid 5540 with > > SIGTERM > > May 08 05:33:19 mysql1 heartbeat: [5536]: CRIT: Emergency Shutdown(MCP > > dead): Killing ourselves. > > > logfacility local0 > > debug 1 > > debugfile /var/log/ha-debug > > logfile /var/log/ha-log > > maybe you should use logd? > > How would that affect Heartbeat crashing? > > node mysql1 > > node mysql2 > > keepalive 2 > > deadtime 60 > > initdead 120 > > warntime 15 > > udpport 694 > > ucast eth1 66.165.231.34 > > ucast eth1 67.218.128.19 > > You should add an additional link. > Really. > What kind of a link should be added? This is the first time I've done an external connection setup like this - previous setups were all on internal networks with directly connected machines. > > > auto_failback on > > crm yes > > Are you short on memory, or under memory pressure? > The servers each have 16 GB. > Are UDP packets dropped? > I'm not showing packets dropping - the DRBD installs seem fine on the same connections. > Packet loss somewhere? > Message corruption? > Firewalled in one direction? > As far as I can tell, no packet loss or corruption, and there's no firewall between the two. Mike Sweetser > > -- > : Lars Ellenberg > : LINBIT | Your Way to High Availability > : DRBD/HA support and consulting http://www.linbit.com > > DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
