Hi Hal,

Hal Rosenstock wrote:


I am sending only request for

        rpc.mgtclass = IB_PERFORMANCE_CLASS;
        rpc.method = IB_MAD_METHOD_GET;

at every one second.


Does perfquery work reliably with the same node(s) you are having
trouble with ?

Does your app follow what perfquery does ?

Yes, perfquery works fine and I am following similar way of implementation. Here is the output. I think difference is there in Load. I am sending 4 GS request per second basis and some got passed and some got timeout(110) or recv failed.

# perfquery
# Port counters: Lid 393 port 1
PortSelect:......................1
CounterSelect:...................0x0000
SymbolErrors:....................0
LinkRecovers:....................0
LinkDowned:......................0
RcvErrors:.......................0
RcvRemotePhysErrors:.............0
RcvSwRelayErrors:................0
XmtDiscards:.....................0
XmtConstraintErrors:.............0
RcvConstraintErrors:.............0
LinkIntegrityErrors:.............0
ExcBufOverrunErrors:.............0
VL15Dropped:.....................0
XmtData:.........................65899728
RcvData:.........................65899656
XmtPkts:.........................915274
RcvPkts:.........................915273


In general, there are a few possibilities (which can cause this). SM
traffic is VL15 whereas GS traffic is on a data VL (usually VL0 in most
subnets).

Some possibilities are:
1. Timeout/retry being hit for some GS traffic (GS request or response
lost/corrupted)

Yes, this is also happening, Sometimes I am getting corrupt data back,


Is there an error indicated ?

For such packets I am getting umad_status as 110.


That's ETIMEDOUT. You need to handle the errors (and not treat the
receive as a valid packet). Are you doing that ?

yes, I am catching this error.


The underlying question is why are you getting the timeout relatively
frequently so I recommend checking all the error counters along the
path.

# Checking Ca: nodeguid 0x00144fa5e9ce001c
Node check lid 392:  OK
Error check on lid 392 (HCA-1) port all:  OK


Are you sure the request gets to the responder ? Does the responder
respond and it doesn't make it back ?

yes As I told It is not 100% failure, It is 30% to 40% failure. But Why ?


-- Hal


_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to