[ ... IB spec stuff about last WQE reached events ... ] > The IPoIB-CM implementation takes the approach by posting another WR > that completes on the same CQ and wait for this WR to return as a WC. > IPoIB first puts the QP in error status, then waits for last WQE event > in async event handler by posting a drain WR, the QP resources will be > released in when last CEQs being generated. However it works for > ConnectionX but not for ehca. > > In ehca implemention it follows Section 11-5.2.5, when the QP gets in > the Error state, and there are no more WQEs on the RQ. So these QP > resources are never being released thus causes QP resources leakage, no > QPs can't be released at all. So when the maxium QPs are reached > (default nonSRQ is 128, SRQ is 4K), no more new connections can be > built. Nodes can't even be reachable.
I don't understand what the problem is for ehca. Once the QP is in the error state, posting a WR to the send queue should complete immediately with a flush error status, and that completion should trigger IPoIB to clean up the QP. What goes wrong with ehca? > LAST WQE reached event for RX QP0 > post last WR for QP0 > poll_cq > below only applies to Mellanox, ehca won't see > last WQ in SRQ > ---------------- > see last WR for QP0 So you're saying that the send request doesn't complete for ehca? That seems like it must be a bug somewhere in the ehca driver/firmware/hardware. This has nothing to do with SRQ or last WQE reached events-- it is the basic requirement that send requests posted when a QP is in the error state complete with a flush error status. > Since nonSRQ doesn't handle async event, it never releases QPs, 128 > connections will run out soon even in a two nodes cluster by repeating > above steps. ( This is another bug, I will submit a fix). Yes, if non-SRQ doesn't free QPs, then this is another bug. > 2. If node-1 fails to send DREQ for any reason to remote, like node-1 > shutdown, then RX QP in node-2 will be put in the error list after > around 21 mins > (IPOIB_CM_RX_TIMEOUT + IPOIB_CM_RX_DELAY 5*256*HZ) > #define IPOIB_CM_RX_TIMEOUT (2 * 256 * HZ) > #define IPOIB_CM_RX_DELAY (3 * 256 * HZ)) > The timer seems too long for release stale QP resources, we could hit QP > run out in a large cluster even for mthca/mlx4. It is a long timeout, but how often does this case happen? When a node crashes? > 1. Whether it's a MUST to put QP in error status before posting last WR? > if it's a MUST, why? Yes, it's a must because we don't wnat the send executed, we want it to complete with an error status. > 2. Last WQE event is only generated once for each QP even IPoIB sets QP > into error status and the CI surfaced a Local Work Queue Catastrophic > Error on the same QP at the same time, is that right? Umm, a local work queue catastrophic error means something went wrong in the driver/firmware/hardware -- a consumer shouldn't be able to cause this type of event. Finding out why this catastrophic error happens should help debug things. - R. _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
