On Fri, 2008-06-27 at 17:01 +0530, Sumit Gaur - Sun Microsystem wrote: > Find my answers below:- > > Hal Rosenstock wrote: > > Hi Sumit, > > > > On Thu, 2008-06-26 at 12:07 +0530, Sumit Gaur - Sun Microsystem wrote: > > > >>Hi Hal, > >> > >>Hal Rosenstock wrote: > >> > >>>>>>I am sending only request for > >>>>>> > >>>>>> rpc.mgtclass = IB_PERFORMANCE_CLASS; > >>>>>> rpc.method = IB_MAD_METHOD_GET; > >>>>>> > >>>>>>at every one second. > >>> > >>> > >>>Does perfquery work reliably with the same node(s) you are having > >>>trouble with ? > >>> > >>>Does your app follow what perfquery does ? > >> > >>Yes, perfquery works fine and I am following similar way of implementation. > >>Here > >>is the output. I think difference is there in Load. I am sending 4 GS > >>request > >>per second basis and some got passed and some got timeout(110) or recv > >>failed. > > Can you elaborate on the multiple sends ? Are they outstanding > > concurrently ? Are they to the same destination or different ones ? Are > > they from a single or multiple threads ? > No they are sending sequentially(mutex enabled) no concurrency but timeout > for > umad_recv is 100ms.
Can you try increasing that to see if there is some threshold where it works more reliably ? Does it work better at say 200 msec (as you said your rate was 4/sec) ? The default timeout used in the diags is 1 sec. BTW, this could explain the timeouts but I'm not sure about the other errors you mentioned. > Yes they are for same destination. They all are from single > threads. I still point out same I configure for SMP and no failure. > > > > > >># perfquery > >># Port counters: Lid 393 port 1 > >>PortSelect:......................1 > >>CounterSelect:...................0x0000 > >>SymbolErrors:....................0 > >>LinkRecovers:....................0 > >>LinkDowned:......................0 > >>RcvErrors:.......................0 > >>RcvRemotePhysErrors:.............0 > >>RcvSwRelayErrors:................0 > >>XmtDiscards:.....................0 > >>XmtConstraintErrors:.............0 > >>RcvConstraintErrors:.............0 > >>LinkIntegrityErrors:.............0 > >>ExcBufOverrunErrors:.............0 > >>VL15Dropped:.....................0 > >>XmtData:.........................65899728 > >>RcvData:.........................65899656 > >>XmtPkts:.........................915274 > >>RcvPkts:.........................915273 > >> > >> > > > > > > > > OK but you had said the received packet was corrupted. Maybe a nit, but > > with timeout and other errors, the receive packet is invalid rather than > > corrupted (an app shouldn't be looking at the response in the error > > cases). > > > > > >>>The underlying question is why are you getting the timeout relatively > >>>frequently so I recommend checking all the error counters along the > >>>path. > >> > >># Checking Ca: nodeguid 0x00144fa5e9ce001c > >>Node check lid 392: OK > >>Error check on lid 392 (HCA-1) port all: OK > > > > > > Is that the requester or responder ? It's not the entire path. Maybe the > > simplest thing is: what does ibchecknet or ibcheckerrors say ? > > > > I am using lid 106 > > [EMAIL PROTECTED] tmp]# ibchecknet > #warn: counter SymbolErrors = 65535 (threshold 10) > Error check on lid 106 (MT47396 Infiniscale-III Mellanox Technologies) port > all: > FAILED > #warn: counter SymbolErrors = 65535 (threshold 10) > #warn: counter LinkRecovers = 18 (threshold 10) > Error check on lid 4 (MT47396 Infiniscale-III Mellanox Technologies) port > all: > FAILED > #warn: counter SymbolErrors = 65535 (threshold 10) > #warn: counter LinkRecovers = 19 (threshold 10) > Error check on lid 9 (MT47396 Infiniscale-III Mellanox Technologies) port > all: > FAILED > #warn: counter SymbolErrors = 65535 (threshold 10) > #warn: counter LinkRecovers = 26 (threshold 10) > #warn: counter LinkDowned = 13 (threshold 10) > #warn: counter RcvErrors = 27 (threshold 10) > Error check on lid 10 (MT47396 Infiniscale-III Mellanox Technologies) port > all: > FAILED > #warn: counter SymbolErrors = 65535 (threshold 10) > #warn: counter LinkRecovers = 255 (threshold 10) > Error check on lid 1 (MT47396 Infiniscale-III Mellanox Technologies) port > all: > FAILED > #warn: counter SymbolErrors = 1968 (threshold 10) > Error check on lid 5 (MT47396 Infiniscale-III Mellanox Technologies) port > all: > FAILED > #warn: counter SymbolErrors = 1967 (threshold 10) > Error check on lid 5 (MT47396 Infiniscale-III Mellanox Technologies) port 15: > FAILED > # Checked Switch: nodeguid 0x00144f0000a61390 with failure > #warn: counter SymbolErrors = 65535 (threshold 10) > Error check on lid 7 (MT47396 Infiniscale-III Mellanox Technologies) port > all: > FAILED > #warn: counter SymbolErrors = 65535 (threshold 10) > #warn: counter LinkDowned = 12 (threshold 10) > Error check on lid 3 (MT47396 Infiniscale-III Mellanox Technologies) port > all: > FAILED > #warn: counter SymbolErrors = 65535 (threshold 10) > #warn: counter LinkRecovers = 15 (threshold 10) > #warn: counter LinkDowned = 12 (threshold 10) > Error check on lid 2 (MT47396 Infiniscale-III Mellanox Technologies) port > all: > FAILED > #warn: counter SymbolErrors = 65535 (threshold 10) > #warn: counter LinkRecovers = 15 (threshold 10) > #warn: counter LinkDowned = 12 (threshold 10) > Error check on lid 11 (MT47396 Infiniscale-III Mellanox Technologies) port > all: > FAILED > #warn: counter SymbolErrors = 65535 (threshold 10) > Error check on lid 6 (MT47396 Infiniscale-III Mellanox Technologies) port > all: > FAILED > #warn: counter SymbolErrors = 65535 (threshold 10) > #warn: counter LinkRecovers = 255 (threshold 10) > #warn: counter RcvErrors = 445 (threshold 10) > Error check on lid 12 (MT47396 Infiniscale-III Mellanox Technologies) port > all: > FAILED > #warn: Logical link state is Initialize > Port check lid 12 port 15: FAILED > # Checked Switch: nodeguid 0x00144f0000a61397 with failure > #warn: Logical link state is Initialize > Port check lid 12 port 14: FAILED > #warn: Logical link state is Initialize > Port check lid 12 port 13: FAILED > #warn: counter LinkRecovers = 11 (threshold 10) > Error check on lid 12 (MT47396 Infiniscale-III Mellanox Technologies) port > 13: > FAILED > > # Checking Ca: nodeguid 0x00144fa5e9ce001c > > # Checking Ca: nodeguid 0x00144fa5e9ce000c > > # Checking Ca: nodeguid 0x00144fa5e9ce0014 > > # Checking Ca: nodeguid 0x00144fa5e9ce0004 > > ## Summary: 29 nodes checked, 0 bad nodes found > ## 359 ports checked, 3 bad ports found > ## 2 ports have errors beyond threshold > > > In any case, based on your comments above about perfquery working > > reliably, I'm skeptical whether this is the issue but it's best to rule > > it out. > [EMAIL PROTECTED] tmp]# ibcheckerrors > #warn: counter SymbolErrors = 65535 (threshold 10) > Error check on lid 106 (MT47396 Infiniscale-III Mellanox Technologies) port > all: > FAILED > #warn: counter SymbolErrors = 65535 (threshold 10) > #warn: counter LinkRecovers = 18 (threshold 10) > Error check on lid 4 (MT47396 Infiniscale-III Mellanox Technologies) port > all: > FAILED > #warn: counter SymbolErrors = 65535 (threshold 10) > #warn: counter LinkRecovers = 19 (threshold 10) > Error check on lid 9 (MT47396 Infiniscale-III Mellanox Technologies) port > all: > FAILED > #warn: counter SymbolErrors = 65535 (threshold 10) > #warn: counter LinkRecovers = 26 (threshold 10) > #warn: counter LinkDowned = 13 (threshold 10) > #warn: counter RcvErrors = 27 (threshold 10) > Error check on lid 10 (MT47396 Infiniscale-III Mellanox Technologies) port > all: > FAILED > #warn: counter SymbolErrors = 65535 (threshold 10) > #warn: counter LinkRecovers = 255 (threshold 10) > Error check on lid 1 (MT47396 Infiniscale-III Mellanox Technologies) port > all: > FAILED > #warn: counter SymbolErrors = 2081 (threshold 10) > Error check on lid 5 (MT47396 Infiniscale-III Mellanox Technologies) port > all: > FAILED > #warn: counter SymbolErrors = 2080 (threshold 10) > Error check on lid 5 (MT47396 Infiniscale-III Mellanox Technologies) port 15: > FAILED > # Checked Switch: nodeguid 0x00144f0000a61390 with failure > #warn: counter SymbolErrors = 65535 (threshold 10) > Error check on lid 7 (MT47396 Infiniscale-III Mellanox Technologies) port > all: > FAILED > #warn: counter SymbolErrors = 65535 (threshold 10) > #warn: counter LinkDowned = 12 (threshold 10) > Error check on lid 3 (MT47396 Infiniscale-III Mellanox Technologies) port > all: > FAILED > #warn: counter SymbolErrors = 65535 (threshold 10) > #warn: counter LinkRecovers = 15 (threshold 10) > #warn: counter LinkDowned = 12 (threshold 10) > Error check on lid 2 (MT47396 Infiniscale-III Mellanox Technologies) port > all: > FAILED > #warn: counter SymbolErrors = 65535 (threshold 10) > #warn: counter LinkRecovers = 15 (threshold 10) > #warn: counter LinkDowned = 12 (threshold 10) > Error check on lid 11 (MT47396 Infiniscale-III Mellanox Technologies) port > all: > FAILED > #warn: counter SymbolErrors = 65535 (threshold 10) > Error check on lid 6 (MT47396 Infiniscale-III Mellanox Technologies) port > all: > FAILED > #warn: counter SymbolErrors = 65535 (threshold 10) > #warn: counter LinkRecovers = 255 (threshold 10) > #warn: counter RcvErrors = 445 (threshold 10) > Error check on lid 12 (MT47396 Infiniscale-III Mellanox Technologies) port > all: > FAILED > #warn: counter LinkRecovers = 11 (threshold 10) > Error check on lid 12 (MT47396 Infiniscale-III Mellanox Technologies) port > 13: > FAILED > # Checked Switch: nodeguid 0x00144f0000a61397 with failure > > ## Summary: 29 nodes checked, 0 bad nodes found > ## 359 ports checked, 2 ports have errors beyond threshold Looks like there are some issues here to debug in your subnet. It might help to clear the counters and see what is actively going on to isolate these issues. This could factor into those other errors you are seeing. -- Hal > >>>Are you sure the request gets to the responder ? Does the responder > >>>respond and it doesn't make it back ? > >> > >>yes As I told It is not 100% failure, It is 30% to 40% failure. But Why ? > > > > > > I don't know enough about what is different about your app yet to say > > more right now. > > > > -- Hal > > > > > >>>-- Hal > >>> > >>> > > > > _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
