Re: [DRBD-user] "PingAck not received" messages

Matthew Bloch Fri, 18 May 2012 09:49:46 -0700

On 18/05/12 15:04, Lars Ellenberg wrote:

On Wed, May 16, 2012 at 09:11:05PM +0100, Matthew Bloch wrote:

I'm trying to understand a symptom for a client who uses drbd to run
sets of virtual machines between three pairs of servers (v1a/v1b,
v2a/v2b, v3a/v3b), and I wanted to understand a bit better how DRBD I/O
is buffered depending on what mode is chosen, and buffer settings.


Firstly, it surprised me that even in replication mode "A", the system
still seemed limited by by the bandwidth between nodes.  I found this
out when the customer's bonded interface had flipped over to its 100Mb
backup connection, and suddenly they had I/O problems.  While I was
investigating this and running tests, I noticed that switching to mode A
didn't help, even when measuring short transfers that I'd expect would
fit into reasonable-sized buffers.  What kind of buffer size can I
expect from an "auto-tuned" DRBD?  It seems important to be able to
cover bursts without leaning on the network, so I'd like to know whether
that's possible with some special tuning.


Uhm, well,
we have invented the DRBD Proxy specifically for that purpose.

That's useful to know - so the kernel buffering, however it'sconfigured, isn't really set up for handling longer delays? I don'tthink that's my problem, as the ICMP ping time between the servers is<1ms, doesn't drop out even while DRBD reports it hasn't seen its ownpings. It's gigabit ethernet all the way, and on a private LAN.

The other problem is the "PingAck not received" messages that have been
littering the logs of the v3a/v3b servers for the last couple of weeks,
e.g. this has been happening every few hours for one DRBD or another:

May 14 08:21:45 v3b kernel: [661127.869500] block drbd10: PingAck did
not arrive in time.


Increase ping timeout?


I did that (now at 3s, from 0.5s) but I still get reconnections.

I set up a two pairs of VMs to write 1MB to the DRBD every second, andtime it. On the problematic machines, I saw lots of times where thewrite took more than 10s, and a couple of those corresponded with DRBDreconnections. On the normal machines, only two of the writes took morethan 0.1s!

So I'm still hunting for what might be going wrong, even though thesoftware versions are the same, the drbd links aren't hitting theceiling, they're doing no more I/O than the "good" pairs. I think nextwill be to take some packet dumps to see if there is anything odd goingon at the TCP layer.

If nobody else on the list has seen this sort of behaviour, and Linbithave a day rate :-) please get in touch privately, I'd rather get youguys to fix this for our customer.


Best wishes,

--
Matthew Bloch                             Bytemark Hosting
                                http://www.bytemark.co.uk/
                                  tel: +44 (0) 1904 890890
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Re: [DRBD-user] "PingAck not received" messages

Reply via email to