On 01/23/2015 02:39 PM, Bill Wichser wrote:
We had a strange event last night.  Our IB fabric started demonstrating some 
odd routing behavior over IB.

host A could ping both B and C, yet B and C could not ping one another.  This 
was only at the IP layer.  ibping tests all worked fine.  A few runs of 
ibdiagnet produced all the switches and hosts we expected to find.

As we rebooted hosts with non-connectivity, they came up find but then host A 
could reach neither one.  After a number of host reboots we soon realized that 
we were playing whack-a-mole as the problem resurfaced sometimes on the 
original hosts and sometimes on a new host.

In the end we rebooted every Mellanox switch.  The big core switch.  The half 
rack switch.  The top of rack switches.  And sure enough, everything came back 
fine without any more reboots.

At this point all I know is that the server running our master opensm rebooted 
and it took a few hours before these problems started, first indicated by stale 
filesystem errors across the GPFS mounts.

Obviously, rebooting every dang switch is not the correct answer here. But at 
this point I don't have a better solution if it occurs again.  Or even an 
answer as to WHY it happened in the first place.  It just seems that the IPoIB 
layer was at fault here somehow in that routing was not correct across the 
entire IB network.

If anyone has any insights, I'd be most appreciative.  It's clear we do not 
understand this aspect of the IB stack and how this layer works.

Thanks,
Bill

This reminds me of when we upgraded to SL-6.6 (approximately the same as 
CentOS-6.6 and RHEL-6.6).

The new kernel we got, could not handle our IPoIB for storage traffic, which 
broke down within
a few hours.

As far as I have heard, Redhat tries to fix this. Here is a link to a message 
indicating this, that
I got from NSC in Linkoping:
https://www.mail-archive.com/[email protected]/msg22511.html

Best wishes,
-- Lennart Karlsson, UPPMAX, Uppsala University, Sweden

_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to