Re: [DRBD-user] "PingAck not received" messages

Matthew Bloch Thu, 24 May 2012 06:09:44 -0700

On 24/05/12 13:54, Florian Haas wrote:

On Thu, May 24, 2012 at 1:53 PM, Matthew Bloch <[email protected]> wrote:

Hmm, thanks.  Unrelated to any of this, the v3a kernel (Debian 2.6.32-4)
crashed pretty badly 48hrs ago.  Since it has been rebooted - there have
been no "PingAck not received" messages.


Sure, if you have kernel-induced network problems on one of your
nodes, that would definitely explain the issues you're seeing. But you
insisted from the start that there were no network issues. :)

No indeed, nothing external that we could detect after hours of layer 2tracing, and no messages that would indicate a malfunction on either ofthe hosts. But this network problem was only visible via DRBD'smessages, and if it's gone it's hard to reason about it any further (notthat I miss it). As I said I couldn't see any symptoms via ICMP orTCP-based tests between the hosts.

We are preparing to jump to a 2.6.32 sourced from CentOS because this Debian
kernel seems to crash with one bug or another every few months.


That would seem like an odd thing to do. FWIW, we've been running
happily on squeeze kernels for months.

Then you've not hit the "scheduler divide by zero" bug or the "I/Ofrozen for 120s for no reason" bug or the "CPU#x stuck for 9999999s"bug? These are all things that are filed vaguely on the Redhat bugtrackers, as far as I know, and usually closed a few kernel versionslater with "well I haven't seen it for a few kernel versions so it'sprobably OK"!

These are relatively rare bugs, except for some of our customers, whenthey're not at all rare and we haul them up to e.g. whatever wheezy has.Except in this case they broke the briding code in 3.2.0 which isgoing to cause a virtualising customer some problems :-)

The reason we're using external meta-devices is for backup: without the
metadata at the end, the underlying disk image represents exactly what the
VMs see.  We can then snapshot this and take a reasonably consistent backup
without bothering DRBD.  We later verify this backup by booting it back up,
disconnected, and taking a snapshot of the VNC console!


You can always to that from a device with metadata as well. kpartx is
your friend.

Sure, but neither do we pay a penalty for doing it externally. It's allon LVM and proper battery-backed RAID.

The reason I picked protocol B is because LVM snaphots kill the local DRBD
performance if we snapshot the LVM device underlying the DRBD Primary.  If
we snapshot the Secondary and used protocol B where we weren't dependent on
local write speeds, my working theory was that the performance hit wouldn't
be as noticeable, and the customer seemed to concur (previously we were
using C).


That's a fair point, but realistically, how long does it take you to
take the backup off your snapshot?

10-60 minutes per system. Long enough that the I/O sensitive VMsnotice. And the customer has customers who are up 24 hours a day, sothere is no reliable "quiet time" when we can reduce their I/O bandwidthand not have it commented on.

And does this normally coincide
with the DRBD device getting hammered, which is pretty much the only
situation in which a downstream client would likely feel any
disruption?

The DRBDs don't really get hammered at any one time - the backups happendirect from LVs on the host, and go over the main (not replication)interface. So the host system's I/O is stressed, sure.

Previously the disconnects happened several times a day, not just whenthe backups ran - this is a separate issue from the one I asked aboutwhile still being relevant to the list.

Arguably a customer running a heavily interactive system to very remotedestinations shouldn't be using such a complex I/O stack and should usededicated hardware. This is a pragmatic, expensive, unambitiousarguemnt :-) But drbd+LVM has worked very well for them for 18 months,and the peace of mind of being able to start their customers' VMs in oneof two places makes diagnosing this properly worth the effort.


--
Matthew

_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Re: [DRBD-user] "PingAck not received" messages

Reply via email to