George Bosilca wrote:
Eugene,
This error indicate that somehow we're accessing the QP while the QP is in
"down" state. As the asynchronous thread is the one that see this error, I
wonder if it doesn't look for some information about a QP that has been destroyed by the
main thread (as this only occurs in MPI_Finalize).
Can you look in the syslog to see if there is any additional info related to
this issue there?
Not much. A one-liner like this:
Dec 27 21:49:36 burl-ct-x4150-11 hermon: [ID 492207 kern.info] hermon1:
EQE local access violation
On Dec 30, 2010, at 20:43, Eugene Loh <eugene....@oracle.com> wrote:
I was running a bunch of np=4 test programs over two nodes. Occasionally,
*one* of the codes would see an IBV_EVENT_QP_ACCESS_ERR during MPI_Finalize().
I traced the code and ran another program that mimicked the particular MPI
calls made by that program. This other program, too, would occasionally
trigger this error. I never saw the problem with other tests. Rate of
incidence could go from consecutive runs (I saw this once) to 1:100s (more
typically) to even less frequently -- I've had 1000s of consecutive runs with
no problems. (The tests run a few seconds apiece.) The traffic pattern is
sends from non-zero ranks to rank 0, with root-0 gathers, and lots of
Allgathers. The largest messages are 1000bytes. It appears the problem is
always seen on rank 3.
Now, I wouldn't mind someone telling me, based on that little information, what
the problem is here, but I guess I don't expect that. What I am asking is what
IBV_EVENT_QP_ACCESS_ERR means. Again, it's seen during MPI_Finalize. The
async thread is seeing this. What is this error trying to tell me?