Roger Spellman wrote:
Hello,
I have a Mellanox MT25204, running the latest firmware.  A few days ago,
I was getting Catastrophic errors from the firmware.  I found the
following in the Release Notes for RHEL-5:

Hardware testing for the Mellanox MT25204 has revealed that an internal error occurs under certain high-load conditions. When the ib_mthca driver reports a catastrophic error on this hardware, it is usually related to an insufficient completion queue depth relative to the number of outstanding work

       requests generated by the user application.

Increasing my CQ size did indeed solve the problem.  So, I wanted to
understand why.  I think the reason may be a bug in the mthca code that
comes with OFED.
My code creates a CQ of size 2072, and a SQ of size 2056, and a RQ of
size 16.  As you can see, CQ = SQ + RQ.  So, I should never overflow my
CQ.

The Driver raises each of these to the next power of two.  So, we get a
CQ of size 4096, a SQ of size 4096, and an RQ of size 16.

As you can see, CQ < SQ + RQ, so it is possible to overflow the CQ.

I don't think that this should cause the Firmware to generate a
Catastrophic error (sounds like a bug in the firmware, if you ask me).

The CQ's size is increased in the function mthca_create_cq() in the file
mthca_provider.c.  The SQ and RQ sizes are increased in the function
mthca_alloc_qp_common() in the file mthca_qp.c if and only if the
function mthca_is_memfree() returns TRUE; this function returns TRUE
when MTHCA_FLAG_MEMFREE is set in dev->mthca_flags, which it is for the
latest firmware release.

As I said, doubling the queue size solves the problem.  However, it
would be better if the mthca driver did not create the problem in the
first place.  If a QP is being created such that CQ >= SQ + RQ, then
that relationship should be maintained.  Do others agree with me?

The driver cannot really ensure this because the CQ might be used for more than one QP.

But this issue still raises questions in my mind how an application _should_ handle this condition? IE If the app is required to ensure the CQ is big enough, how does it deal with the case where the driver allocates a bigger QP? Resizing the QP after creating the QP and discovering via a query that the QP is too big for the CQs?

Steve.



_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to