> The "send completion errors" indicates the packet hasn't been sent out > to the wire. It seems the retries you have added induced a little bit > delay for the packet to be sent out successfully, which might indicates > some flow control or other issues in the device transport layer?
Actually for RC a send completion error can occur if an ACK is not received for the message. It would be useful to know what the status of the first failed send it though. > Do you have any suggestions on how to debug this problem? How can we > hack the mthca/ipoib code to narrow down the root cause of the problem? > From the behavior it looks like the local resource temp unavailable, but > it could be something else. I definitely think we want to understand what the problem is. For example does it go away if you increase the RNR retry count but not the ACK timeout retry count? When the problem occurs is the receive SRQ empty (or is it only happening with ehca's non-SRQ IPoIB/cm)? - R. _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
