resync]

Steve Thompson Thu, 13 Jan 2011 13:28:38 -0800

On Sat, 8 Jan 2011, Steve Thompson wrote:

CentOS 5.5, x86_64, drbd 8.3.8, Dell PE2900 servers w/16GB memory. Thereplication link is a dual GbE bonded pair (point-to-point, no switches)in balance-rr mode with MTU=9000. Using tcp_reordering=127.

I reported that a resync failed and restarted every minute or so for acouple of weeks. I have found the cause, but am not sure of the solution.

I'll try to keep it short.

First, I swapped cables, cards, systems, etc in order to be sure of theintegrity of the hardware. All hardware checks out OK.

Secondly, I was using a data-integrity-alg of sha1 or crc32c (tried both).Only when this was removed from the configuration was I able to get a fullresync to complete. There is an ext3 file system on the drbd volume, butit is quiet; this is a non-production test system.

After this, a verify pass showed several out of sync blocks. I disconnectand reconnect and re-run the verify pass. Now more out of sync blocks, butin a different place. Rinse and repeat; verify was never clean, and out ofsync blocks were never in the same place twice.


I changed MTU to 1500. No difference; still can't get a clean verify.

I changed tcp_reordering to 3. No difference (no difference inperformance, either).

Finally, I shut down half of the bonded pair on each system, so I'm usingeffectively a single GbE link with MTU=9000 and tcp_reordering=127. Wow,now everything is working fine; syncs are clean, verifies are clean,violins are playing.

My question is: WTF? I'd really like to get the bonding pair workingagain, for redundancy and performance, but it very quickly falls apart inthis case. I'd appreciate any insight into this that anyone can give.


Steve
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

[DRBD-user] Bonding [WAS repeated resync/fail/resync]

Reply via email to