On 6/23/2014 12:31 PM, Chuck Lever wrote:
On Jun 23, 2014, at 1:25 PM, Hefty, Sean <[email protected]> wrote:

For the record, with both mlx4 and cxgb4, we see FRMRs left valid
after a FAST_REG_MR is flushed during a connection loss. More study
needed, obviously.
Is the bug that this type of WR completes in error, but actually exposed the 
memory region?
We haven’t checked if the MR is exposed; hadn’t thought of that!

I don't think this is a bug. It is a race where HW is in the process of fast-registering the memory at the time the QP is moved out of RTS causing all pending work requests to get FLUSHED. I looked at both the IBTA IB and IETF iWARP Verbs specs, and neither state explicitly what FLUSHED status means. They both say "at the the time the QP was moved to ERROR the work request was not complete". That's doesn't indicate that the work request was canceled or didn't actually complete. At least that's how I read it. Irregardless, the chelsio hardware behaves this way. And apparently the mlx hardware does too.

Anyway, for cxgb4 at least, the FRMR can be left in the valid state. The correct procedure, in the case of a fast-reg wr completing as FLUSHED is to dereg the MR if you want to ensure the region is invalidated.

What we do know is that a subsequent LOCAL_INVALIDATE using the rkey that
should work (if FAST_REG_MR had indeed never been done) fails in some cases.
With mlx4, the LINV completes with IB_WC_MW_BIND_ERR. Steve can provide
more detail about the exact failure mode with cxgb4.

cxgb4 completes with IB_WC_LOC_ACCESS_ERR.

Steve.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to