[ofa-general] OpenSM "stuck" - user level MAD library seems to be timing out

lbt Thu, 02 Aug 2007 12:33:43 -0700

Hi Sasha,

I am hitting a problem where the user level MAD library seems to be timing
out, causing the ports to be stuck in "INIT" state because the subnet has no
"Master" SM available. The system is still in this state, so if there are
any suggestions on what other type of debug info I could collect or clues to
what the problem might be, it would be much apprceciated :)


I have 3 machines (OFED 1.1. stack, Opensm v2.0.5), where 2 of them are
running open SM, connected by an IB switch. Several tests were being done
pulling IB cables, but not touching at all the IB connections between the
Master SM and the IB switch, or rebooting the IB switch (i.e. no SM
migration should be occurring). Everything was working fine, until at one
point, I pull the IB cable on the IB switch of the lower priority (standby)
SM. For some reason, this starts causing problems on the higher priority
Master SM. The higher priority SM now thinks it's in Standby state, and the
lower priority SM's MAD packets are timing out.  It is odd because, I would
not expect any effect on the higher priority SM (as it's IB connections are
not being affected). And not sure why MAD packets are timing out on the
lower priority SM. Rebooting the lower priority SM and replugging IB cables
into different ports on the IB switch, didn't help.

Lower priority SM: (packets timeout)
[EMAIL PROTECTED] ~]# sminfo -d -e -P 1
ibwarn: [26764] smp_query: attr 21 mod 0 route DR path [0]
ibwarn: [26764] mad_rpc: data offs 64 sz 64
mad data
0000 0000 0000 0000 fe80 0000 0000 0000
0003 0001 0251 0a6a 0000 0000 0103 0302
1252 0011 4040 0008 0804 ff40 0000 0000
0000 2012 1088 0000 0000 0000 0000 0000
ibwarn: [26764] smp_query: attr 32 mod 0 route Lid 1
ibwarn: [26764] _do_madrpc: retry 1 (timeout 1000 ms)
ibwarn: [26764] _do_madrpc: retry 2 (timeout 1000 ms)
ibwarn: [26764] _do_madrpc: timeout after 3 retries, 3000 ms
sminfo: iberror: [pid 26764] main: failed: query

Higher priority SM: (thinks its Standby now)
[EMAIL PROTECTED] log]# sminfo -d -e -P 1
ibwarn: [2487] smp_query: attr 21 mod 0 route DR path [0]
ibwarn: [2487] mad_rpc: data offs 64 sz 64
mad data
0000 0000 0000 0000 fe80 0000 0000 0000
0002 0003 0251 0a6a 0000 0000 0103 0302
1252 0011 4040 0008 0804 ff40 0000 0000
0000 2012 1088 0000 0000 0000 0000 0000
ibwarn: [2487] smp_query: attr 32 mod 0 route Lid 3
ibwarn: [2487] mad_rpc: data offs 64 sz 64
mad data
0050 4501 4a3a 0001 0000 0000 0000 0000
0000 020e 0200 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
sminfo: sm lid 3 sm guid 0x5045014a3a0001, activity count 526 priority 0
state 2 SMINFO_STANDBY

Just another data point, but each machine happens to have 2 HCA ports, port
1 and port 2. Port 1 is connected to different subnet than port2. During all
these steps, port2 subnet is still fine and working OK. The problem
described above was being seen with the port 1 subnet only.

Thanks!
Lan

_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[ofa-general] OpenSM "stuck" - user level MAD library seems to be timing out

Reply via email to