Hi Sasha, I am hitting a problem where the user level MAD library seems to be timing out, causing the ports to be stuck in "INIT" state because the subnet has no "Master" SM available. The system is still in this state, so if there are any suggestions on what other type of debug info I could collect or clues to what the problem might be, it would be much apprceciated :)
I have 3 machines (OFED 1.1. stack, Opensm v2.0.5), where 2 of them are running open SM, connected by an IB switch. Several tests were being done pulling IB cables, but not touching at all the IB connections between the Master SM and the IB switch, or rebooting the IB switch (i.e. no SM migration should be occurring). Everything was working fine, until at one point, I pull the IB cable on the IB switch of the lower priority (standby) SM. For some reason, this starts causing problems on the higher priority Master SM. The higher priority SM now thinks it's in Standby state, and the lower priority SM's MAD packets are timing out. It is odd because, I would not expect any effect on the higher priority SM (as it's IB connections are not being affected). And not sure why MAD packets are timing out on the lower priority SM. Rebooting the lower priority SM and replugging IB cables into different ports on the IB switch, didn't help. Lower priority SM: (packets timeout) [EMAIL PROTECTED] ~]# sminfo -d -e -P 1 ibwarn: [26764] smp_query: attr 21 mod 0 route DR path [0] ibwarn: [26764] mad_rpc: data offs 64 sz 64 mad data 0000 0000 0000 0000 fe80 0000 0000 0000 0003 0001 0251 0a6a 0000 0000 0103 0302 1252 0011 4040 0008 0804 ff40 0000 0000 0000 2012 1088 0000 0000 0000 0000 0000 ibwarn: [26764] smp_query: attr 32 mod 0 route Lid 1 ibwarn: [26764] _do_madrpc: retry 1 (timeout 1000 ms) ibwarn: [26764] _do_madrpc: retry 2 (timeout 1000 ms) ibwarn: [26764] _do_madrpc: timeout after 3 retries, 3000 ms sminfo: iberror: [pid 26764] main: failed: query Higher priority SM: (thinks its Standby now) [EMAIL PROTECTED] log]# sminfo -d -e -P 1 ibwarn: [2487] smp_query: attr 21 mod 0 route DR path [0] ibwarn: [2487] mad_rpc: data offs 64 sz 64 mad data 0000 0000 0000 0000 fe80 0000 0000 0000 0002 0003 0251 0a6a 0000 0000 0103 0302 1252 0011 4040 0008 0804 ff40 0000 0000 0000 2012 1088 0000 0000 0000 0000 0000 ibwarn: [2487] smp_query: attr 32 mod 0 route Lid 3 ibwarn: [2487] mad_rpc: data offs 64 sz 64 mad data 0050 4501 4a3a 0001 0000 0000 0000 0000 0000 020e 0200 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 sminfo: sm lid 3 sm guid 0x5045014a3a0001, activity count 526 priority 0 state 2 SMINFO_STANDBY Just another data point, but each machine happens to have 2 HCA ports, port 1 and port 2. Port 1 is connected to different subnet than port2. During all these steps, port2 subnet is still fine and working OK. The problem described above was being seen with the port 1 subnet only. Thanks! Lan
_______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
