George Bosilca wrote:

Eugene,

This error indicate that somehow we're accessing the QP while the QP is in 
"down" state. As the asynchronous thread is the one that see this error, I 
wonder if it doesn't look for some information about a QP that has been destroyed by the 
main thread (as this only occurs in MPI_Finalize).

Can you look in the syslog to see if there is any additional info related to 
this issue there?

Not much.  A one-liner like this:

Dec 27 21:49:36 burl-ct-x4150-11 hermon: [ID 492207 kern.info] hermon1: EQE local access violation

On Dec 30, 2010, at 20:43, Eugene Loh <eugene....@oracle.com> wrote:
I was running a bunch of np=4 test programs over two nodes.  Occasionally, 
*one* of the codes would see an IBV_EVENT_QP_ACCESS_ERR during MPI_Finalize().  
I traced the code and ran another program that mimicked the particular MPI 
calls made by that program.  This other program, too, would occasionally 
trigger this error.  I never saw the problem with other tests.  Rate of 
incidence could go from consecutive runs (I saw this once) to 1:100s (more 
typically) to even less frequently -- I've had 1000s of consecutive runs with 
no problems.  (The tests run a few seconds apiece.)  The traffic pattern is 
sends from non-zero ranks to rank 0, with root-0 gathers, and lots of 
Allgathers.  The largest messages are 1000bytes.  It appears the problem is 
always seen on rank 3.

Now, I wouldn't mind someone telling me, based on that little information, what 
the problem is here, but I guess I don't expect that.  What I am asking is what 
IBV_EVENT_QP_ACCESS_ERR means.  Again, it's seen during MPI_Finalize.  The 
async thread is seeing this.  What is this error trying to tell me?

Reply via email to