Find my answers below:-

Hal Rosenstock wrote:
Hi Sumit,

On Thu, 2008-06-26 at 12:07 +0530, Sumit Gaur - Sun Microsystem wrote:

Hi Hal,

Hal Rosenstock wrote:

I am sending only request for

        rpc.mgtclass = IB_PERFORMANCE_CLASS;
        rpc.method = IB_MAD_METHOD_GET;

at every one second.


Does perfquery work reliably with the same node(s) you are having
trouble with ?

Does your app follow what perfquery does ?

Yes, perfquery works fine and I am following similar way of implementation. Here is the output. I think difference is there in Load. I am sending 4 GS request per second basis and some got passed and some got timeout(110) or recv failed.


Can you elaborate on the multiple sends ? Are they outstanding
concurrently ? Are they to the same destination or different ones ? Are
they from a single or multiple threads ?
No they are sending sequentially(mutex enabled) no concurrency but timeout for umad_recv is 100ms. Yes they are for same destination. They all are from single threads. I still point out same I configure for SMP and no failure.


# perfquery
# Port counters: Lid 393 port 1
PortSelect:......................1
CounterSelect:...................0x0000
SymbolErrors:....................0
LinkRecovers:....................0
LinkDowned:......................0
RcvErrors:.......................0
RcvRemotePhysErrors:.............0
RcvSwRelayErrors:................0
XmtDiscards:.....................0
XmtConstraintErrors:.............0
RcvConstraintErrors:.............0
LinkIntegrityErrors:.............0
ExcBufOverrunErrors:.............0
VL15Dropped:.....................0
XmtData:.........................65899728
RcvData:.........................65899656
XmtPkts:.........................915274
RcvPkts:.........................915273






OK but you had said the received packet was corrupted. Maybe a nit, but
with timeout and other errors, the receive packet is invalid rather than
corrupted (an app shouldn't be looking at the response in the error
cases).


The underlying question is why are you getting the timeout relatively
frequently so I recommend checking all the error counters along the
path.

# Checking Ca: nodeguid 0x00144fa5e9ce001c
Node check lid 392:  OK
Error check on lid 392 (HCA-1) port all:  OK


Is that the requester or responder ? It's not the entire path. Maybe the
simplest thing is: what does ibchecknet or ibcheckerrors say ?


I am using lid 106

[EMAIL PROTECTED] tmp]# ibchecknet
#warn: counter SymbolErrors = 65535     (threshold 10)
Error check on lid 106 (MT47396 Infiniscale-III Mellanox Technologies) port all: FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
#warn: counter LinkRecovers = 18        (threshold 10)
Error check on lid 4 (MT47396 Infiniscale-III Mellanox Technologies) port all: FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
#warn: counter LinkRecovers = 19        (threshold 10)
Error check on lid 9 (MT47396 Infiniscale-III Mellanox Technologies) port all: FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
#warn: counter LinkRecovers = 26        (threshold 10)
#warn: counter LinkDowned = 13  (threshold 10)
#warn: counter RcvErrors = 27   (threshold 10)
Error check on lid 10 (MT47396 Infiniscale-III Mellanox Technologies) port all: FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
#warn: counter LinkRecovers = 255       (threshold 10)
Error check on lid 1 (MT47396 Infiniscale-III Mellanox Technologies) port all: FAILED
#warn: counter SymbolErrors = 1968      (threshold 10)
Error check on lid 5 (MT47396 Infiniscale-III Mellanox Technologies) port all: FAILED
#warn: counter SymbolErrors = 1967      (threshold 10)
Error check on lid 5 (MT47396 Infiniscale-III Mellanox Technologies) port 15: FAILED
# Checked Switch: nodeguid 0x00144f0000a61390 with failure
#warn: counter SymbolErrors = 65535     (threshold 10)
Error check on lid 7 (MT47396 Infiniscale-III Mellanox Technologies) port all: FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
#warn: counter LinkDowned = 12  (threshold 10)
Error check on lid 3 (MT47396 Infiniscale-III Mellanox Technologies) port all: FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
#warn: counter LinkRecovers = 15        (threshold 10)
#warn: counter LinkDowned = 12  (threshold 10)
Error check on lid 2 (MT47396 Infiniscale-III Mellanox Technologies) port all: FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
#warn: counter LinkRecovers = 15        (threshold 10)
#warn: counter LinkDowned = 12  (threshold 10)
Error check on lid 11 (MT47396 Infiniscale-III Mellanox Technologies) port all: FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
Error check on lid 6 (MT47396 Infiniscale-III Mellanox Technologies) port all: FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
#warn: counter LinkRecovers = 255       (threshold 10)
#warn: counter RcvErrors = 445  (threshold 10)
Error check on lid 12 (MT47396 Infiniscale-III Mellanox Technologies) port all: FAILED
#warn: Logical link state is Initialize
Port check lid 12 port 15:  FAILED
# Checked Switch: nodeguid 0x00144f0000a61397 with failure
#warn: Logical link state is Initialize
Port check lid 12 port 14:  FAILED
#warn: Logical link state is Initialize
Port check lid 12 port 13:  FAILED
#warn: counter LinkRecovers = 11        (threshold 10)
Error check on lid 12 (MT47396 Infiniscale-III Mellanox Technologies) port 13: FAILED

# Checking Ca: nodeguid 0x00144fa5e9ce001c

# Checking Ca: nodeguid 0x00144fa5e9ce000c

# Checking Ca: nodeguid 0x00144fa5e9ce0014

# Checking Ca: nodeguid 0x00144fa5e9ce0004

## Summary: 29 nodes checked, 0 bad nodes found
##          359 ports checked, 3 bad ports found
##          2 ports have errors beyond threshold

In any case, based on your comments above about perfquery working
reliably, I'm skeptical whether this is the issue but it's best to rule
it out.
[EMAIL PROTECTED] tmp]# ibcheckerrors
#warn: counter SymbolErrors = 65535     (threshold 10)
Error check on lid 106 (MT47396 Infiniscale-III Mellanox Technologies) port all: FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
#warn: counter LinkRecovers = 18        (threshold 10)
Error check on lid 4 (MT47396 Infiniscale-III Mellanox Technologies) port all: FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
#warn: counter LinkRecovers = 19        (threshold 10)
Error check on lid 9 (MT47396 Infiniscale-III Mellanox Technologies) port all: FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
#warn: counter LinkRecovers = 26        (threshold 10)
#warn: counter LinkDowned = 13  (threshold 10)
#warn: counter RcvErrors = 27   (threshold 10)
Error check on lid 10 (MT47396 Infiniscale-III Mellanox Technologies) port all: FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
#warn: counter LinkRecovers = 255       (threshold 10)
Error check on lid 1 (MT47396 Infiniscale-III Mellanox Technologies) port all: FAILED
#warn: counter SymbolErrors = 2081      (threshold 10)
Error check on lid 5 (MT47396 Infiniscale-III Mellanox Technologies) port all: FAILED
#warn: counter SymbolErrors = 2080      (threshold 10)
Error check on lid 5 (MT47396 Infiniscale-III Mellanox Technologies) port 15: FAILED
# Checked Switch: nodeguid 0x00144f0000a61390 with failure
#warn: counter SymbolErrors = 65535     (threshold 10)
Error check on lid 7 (MT47396 Infiniscale-III Mellanox Technologies) port all: FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
#warn: counter LinkDowned = 12  (threshold 10)
Error check on lid 3 (MT47396 Infiniscale-III Mellanox Technologies) port all: FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
#warn: counter LinkRecovers = 15        (threshold 10)
#warn: counter LinkDowned = 12  (threshold 10)
Error check on lid 2 (MT47396 Infiniscale-III Mellanox Technologies) port all: FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
#warn: counter LinkRecovers = 15        (threshold 10)
#warn: counter LinkDowned = 12  (threshold 10)
Error check on lid 11 (MT47396 Infiniscale-III Mellanox Technologies) port all: FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
Error check on lid 6 (MT47396 Infiniscale-III Mellanox Technologies) port all: FAILED
#warn: counter SymbolErrors = 65535     (threshold 10)
#warn: counter LinkRecovers = 255       (threshold 10)
#warn: counter RcvErrors = 445  (threshold 10)
Error check on lid 12 (MT47396 Infiniscale-III Mellanox Technologies) port all: FAILED
#warn: counter LinkRecovers = 11        (threshold 10)
Error check on lid 12 (MT47396 Infiniscale-III Mellanox Technologies) port 13: FAILED
# Checked Switch: nodeguid 0x00144f0000a61397 with failure

## Summary: 29 nodes checked, 0 bad nodes found
##          359 ports checked, 2 ports have errors beyond threshold




Are you sure the request gets to the responder ? Does the responder
respond and it doesn't make it back ?

yes As I told It is not 100% failure, It is 30% to 40% failure. But Why ?


I don't know enough about what is different about your app yet to say
more right now.

-- Hal


-- Hal




_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to