Marco Barbero wrote:

I'm exeriencing a nasty kernel soft lockup on one cluster.  Have to
say I have tons of clusters using same config and all is working fine

i recently experienced cluster lockup issues because my ethernet adapters (4 / server) which were bonded to "bond0" inadvertently had some options which "appear to be incompatible" with intra-cluster communications, i.e.:

tx-checksumming was on; and
scatter-gather was on; and
TSO was on; and
generic segmentation offload was on.

i noted that these options can cause problems on both bonded and non-bonded intf's. i verified the error using wireshark: i saw tcp and udp checksum errors coming from one server with these options set.

if you do not bond, you can reset these with:

ethtool -K eth"x" tx off sg off tso off gso off

if you bond, you'll probably be unable to reset these with ethtool; instead, reset the eth"x" intf's themselves (even tho they are slaves) and the bonded intf will reset automagically. i verified the fix using wireshark: the server with the checksum errors is now behaving nicely.

hth
yvette hirth

_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to