I'd guess thesame thing as George - a race condition in the shutdown of the 
async thread...?  I haven't looked at that code in a long log time to remember 
how it tried to defend against the race condition. 

Sent from my PDA. No type good. 

On Jan 3, 2011, at 2:31 PM, "Eugene Loh" <eugene....@oracle.com> wrote:

> George Bosilca wrote:
> 
>> Eugene,
>> 
>> This error indicate that somehow we're accessing the QP while the QP is in 
>> "down" state. As the asynchronous thread is the one that see this error, I 
>> wonder if it doesn't look for some information about a QP that has been 
>> destroyed by the main thread (as this only occurs in MPI_Finalize).
>> 
>> Can you look in the syslog to see if there is any additional info related to 
>> this issue there?
>> 
> Not much.  A one-liner like this:
> 
> Dec 27 21:49:36 burl-ct-x4150-11 hermon: [ID 492207 kern.info] hermon1: EQE 
> local access violation
> 
>> On Dec 30, 2010, at 20:43, Eugene Loh <eugene....@oracle.com> wrote:
>> 
>>> I was running a bunch of np=4 test programs over two nodes.  Occasionally, 
>>> *one* of the codes would see an IBV_EVENT_QP_ACCESS_ERR during 
>>> MPI_Finalize().  I traced the code and ran another program that mimicked 
>>> the particular MPI calls made by that program.  This other program, too, 
>>> would occasionally trigger this error.  I never saw the problem with other 
>>> tests.  Rate of incidence could go from consecutive runs (I saw this once) 
>>> to 1:100s (more typically) to even less frequently -- I've had 1000s of 
>>> consecutive runs with no problems.  (The tests run a few seconds apiece.)  
>>> The traffic pattern is sends from non-zero ranks to rank 0, with root-0 
>>> gathers, and lots of Allgathers.  The largest messages are 1000bytes.  It 
>>> appears the problem is always seen on rank 3.
>>> 
>>> Now, I wouldn't mind someone telling me, based on that little information, 
>>> what the problem is here, but I guess I don't expect that.  What I am 
>>> asking is what IBV_EVENT_QP_ACCESS_ERR means.  Again, it's seen during 
>>> MPI_Finalize.  The async thread is seeing this.  What is this error trying 
>>> to tell me?
>>>   
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to