I've been able to duplicate a nasty problem that took
down our system last week. I believe its caused by
IPoIB ARP multicast traffic. The 4036 switch subscribes
to the IPoIB multicast group whether the SM is running
or disabled. When the switch goes belly up, and it can't be
programmed by the SM.

ERR 3113: MAD completed in error (IB_TIMEOUT): SubnSet(SwitchInfo),
attr_mod 0x0, TID 0x22322

smpquery works (nodeinfo/switchinfo) but perfquery fails with
ibwarn: [26748] _do_madrpc: recv failed: Connection timed out

I can log into the switch, but wouldn't know what to look for there.

Voltaire support case number 
 US Case 00018085: Re: NASA Issue [[ ref:00D38IO.5008BPc6T:ref]]

Maybe someone on the voltaire side can help.

I'm working the issue now
Wed Jul 21 00:34:14 PDT 2010



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to