On Sat, 8 Jan 2011, Steve Thompson wrote:
CentOS 5.5, x86_64, drbd 8.3.8, Dell PE2900 servers w/16GB memory. The
replication link is a dual GbE bonded pair (point-to-point, no switches)
in balance-rr mode with MTU=9000. Using tcp_reordering=127.
I reported that a resync failed and restarted every minute or so for a
couple of weeks. I have found the cause, but am not sure of the solution.
I'll try to keep it short.
First, I swapped cables, cards, systems, etc in order to be sure of the
integrity of the hardware. All hardware checks out OK.
Secondly, I was using a data-integrity-alg of sha1 or crc32c (tried both).
Only when this was removed from the configuration was I able to get a full
resync to complete. There is an ext3 file system on the drbd volume, but
it is quiet; this is a non-production test system.
After this, a verify pass showed several out of sync blocks. I disconnect
and reconnect and re-run the verify pass. Now more out of sync blocks, but
in a different place. Rinse and repeat; verify was never clean, and out of
sync blocks were never in the same place twice.
I changed MTU to 1500. No difference; still can't get a clean verify.
I changed tcp_reordering to 3. No difference (no difference in
performance, either).
Finally, I shut down half of the bonded pair on each system, so I'm using
effectively a single GbE link with MTU=9000 and tcp_reordering=127. Wow,
now everything is working fine; syncs are clean, verifies are clean,
violins are playing.
My question is: WTF? I'd really like to get the bonding pair working
again, for redundancy and performance, but it very quickly falls apart in
this case. I'd appreciate any insight into this that anyone can give.
Steve
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user