I hope this is the correct place to get help with the problem I have. I have an
IB fabric running on a Cisco SFS switch with a 7000D as the subnet manager and
the whole thing has been running great for well over a year now, but today I
noticed that after any node gets rebooted its IB link doesn't initialize. This
has happened on 4 hosts now. What I see is as follows:
[r...@compute-2-7 ~]# ibstat
CA 'mthca0'
CA type: MT25204
Number of ports: 1
Firmware version: 1.2.917
Hardware version: 20
Node GUID: 0x0005ad00000c0990
System image GUID: 0x0005ad000100d050
Port 1:
State: Initializing
Physical state: LinkUp
Rate: 20
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02510a68
Port GUID: 0x0005ad00000c0991
I don't know much about subnet managers, since ours is in hardware and we've
never had to configure anything on it, but I can login to the device and it
isn't showing any errors. On a node that hasn't been rebooted recently and is
still working I can see what appears to be a working subnet manager:
[r...@compute-2-10 ~]# sminfo
sminfo: sm lid 2 sm guid 0x5ad00001df2a0, activity count 2146213408 priority 10
state 3 SMINFO_MASTER
The same command on a non-working node shows this:
[r...@compute-2-7 ~]# sminfo
sminfo: sm lid 0 sm guid 0x0, activity count 0 priority 0 state 2 SMINFO_STANDBY
So far I have reseated all the cables involved on both ends and I have moved
the cables on the switch end to new ports and none of that has made a
difference even after reboots. I am hoping to find a node that I can take
offline tomorrow so I can actually test the cables, but since this seems to be
happening to any host that reboots it doesn't appear to be a cabling problem.
Can anybody suggest where I should go from here? Is there anything I can do
from a working or non-working host to diagnose the problem? Should I try
rebooting the subnet manager switch? Will that affect the rest of the fabric?
Thanks,
Mike Robbert
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html