On 01/23/2015 02:39 PM, Bill Wichser wrote:
We had a strange event last night. Our IB fabric started demonstrating some
odd routing behavior over IB.
host A could ping both B and C, yet B and C could not ping one another. This
was only at the IP layer. ibping tests all worked fine. A few runs of
ibdiagnet produced all the switches and hosts we expected to find.
As we rebooted hosts with non-connectivity, they came up find but then host A
could reach neither one. After a number of host reboots we soon realized that
we were playing whack-a-mole as the problem resurfaced sometimes on the
original hosts and sometimes on a new host.
In the end we rebooted every Mellanox switch. The big core switch. The half
rack switch. The top of rack switches. And sure enough, everything came back
fine without any more reboots.
At this point all I know is that the server running our master opensm rebooted
and it took a few hours before these problems started, first indicated by stale
filesystem errors across the GPFS mounts.
Obviously, rebooting every dang switch is not the correct answer here. But at
this point I don't have a better solution if it occurs again. Or even an
answer as to WHY it happened in the first place. It just seems that the IPoIB
layer was at fault here somehow in that routing was not correct across the
entire IB network.
If anyone has any insights, I'd be most appreciative. It's clear we do not
understand this aspect of the IB stack and how this layer works.
Thanks,
Bill
This reminds me of when we upgraded to SL-6.6 (approximately the same as
CentOS-6.6 and RHEL-6.6).
The new kernel we got, could not handle our IPoIB for storage traffic, which
broke down within
a few hours.
As far as I have heard, Redhat tries to fix this. Here is a link to a message
indicating this, that
I got from NSC in Linkoping:
https://www.mail-archive.com/[email protected]/msg22511.html
Best wishes,
-- Lennart Karlsson, UPPMAX, Uppsala University, Sweden
_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf