Hi Sumit, On Thu, 2008-06-26 at 12:07 +0530, Sumit Gaur - Sun Microsystem wrote: > Hi Hal, > > Hal Rosenstock wrote: > > > >>>> > >>>>I am sending only request for > >>>> > >>>> rpc.mgtclass = IB_PERFORMANCE_CLASS; > >>>> rpc.method = IB_MAD_METHOD_GET; > >>>> > >>>>at every one second. > > > > > > Does perfquery work reliably with the same node(s) you are having > > trouble with ? > > > > Does your app follow what perfquery does ? > > Yes, perfquery works fine and I am following similar way of implementation. > Here > is the output. I think difference is there in Load. I am sending 4 GS request > per second basis and some got passed and some got timeout(110) or recv failed.
Can you elaborate on the multiple sends ? Are they outstanding concurrently ? Are they to the same destination or different ones ? Are they from a single or multiple threads ? > # perfquery > # Port counters: Lid 393 port 1 > PortSelect:......................1 > CounterSelect:...................0x0000 > SymbolErrors:....................0 > LinkRecovers:....................0 > LinkDowned:......................0 > RcvErrors:.......................0 > RcvRemotePhysErrors:.............0 > RcvSwRelayErrors:................0 > XmtDiscards:.....................0 > XmtConstraintErrors:.............0 > RcvConstraintErrors:.............0 > LinkIntegrityErrors:.............0 > ExcBufOverrunErrors:.............0 > VL15Dropped:.....................0 > XmtData:.........................65899728 > RcvData:.........................65899656 > XmtPkts:.........................915274 > RcvPkts:.........................915273 > > > > >>>>>In general, there are a few possibilities (which can cause this). SM > >>>>>traffic is VL15 whereas GS traffic is on a data VL (usually VL0 in most > >>>>>subnets). > >>>>> > >>>>>Some possibilities are: > >>>>>1. Timeout/retry being hit for some GS traffic (GS request or response > >>>>>lost/corrupted) > >>>> > >>>>Yes, this is also happening, Sometimes I am getting corrupt data back, > >>> > >>> > >>>Is there an error indicated ? > >> > >>For such packets I am getting umad_status as 110. > > > > > > That's ETIMEDOUT. You need to handle the errors (and not treat the > > receive as a valid packet). Are you doing that ? > > yes, I am catching this error. OK but you had said the received packet was corrupted. Maybe a nit, but with timeout and other errors, the receive packet is invalid rather than corrupted (an app shouldn't be looking at the response in the error cases). > > The underlying question is why are you getting the timeout relatively > > frequently so I recommend checking all the error counters along the > > path. > > # Checking Ca: nodeguid 0x00144fa5e9ce001c > Node check lid 392: OK > Error check on lid 392 (HCA-1) port all: OK Is that the requester or responder ? It's not the entire path. Maybe the simplest thing is: what does ibchecknet or ibcheckerrors say ? In any case, based on your comments above about perfquery working reliably, I'm skeptical whether this is the issue but it's best to rule it out. > > Are you sure the request gets to the responder ? Does the responder > > respond and it doesn't make it back ? > > yes As I told It is not 100% failure, It is 30% to 40% failure. But Why ? I don't know enough about what is different about your app yet to say more right now. -- Hal > > -- Hal > > > > _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
