On Fri, 9 Mar 2012, Jeffrey Squyres wrote:

On Mar 9, 2012, at 1:14 PM, George Bosilca wrote:

The hang occurs because there is nothing on the lru to deregister and 
ibv_reg_mr (or GNI_MemRegister in the uGNI case) fails. The PML then puts the 
request on its rdma pending list and continues. If any message comes in the 
rdma pending list is progressed. If not it hangs indefinitely!

Unlike Jeff, I'm not in favor of adding bandages. If the cause is understood, 
then there _is_ a fix, and that fix should be the target of any efforts.

The fix that Nathan proposes is not a complete fix -- we can still run out of 
memory and hang.  You should read the open tickets and prior emails we have 
sent about this -- Nathan's fix merely delays when we will run out of 
registered memory.  It does not solve the underlying problem.

Correct.

In general I have found the underlying cause of the hang is due to an imbalance 
of registrations between processes on a node. i.e the hung process has an empty 
lru but other processes could deregister. I am working on a new mpool (grdma) 
to handle the imbalance. The new mpool will allow a process to request that one 
of its peers deregisters from it lru if possible. I have a working proof of 
concept implementation that uses a posix shmem segment and a progress function 
to handle signaling and dereferencing. With it I no longer see hangs with IMB 
Alltoall/Alltoallv on uGNI (without putting an artificial limit on the number 
of registrations). I will test the mpool on infiniband later today.

If a solution already exists I don't see why we have to have the message code. 
Based on its urgency, I'm confident your patch will make its way into the 1.5 
quite easily.


Depending on the timing, this might go to 1.6 (1.5.5 has waited for too long, 
and this is not a regression).  Keep in mind that the problem has been around 
for *a long, long time*, which is why I approved the diag message (i.e., 
because a real solution is still nowhere in sight).  The real issue is that we 
can still run out of registered memory *and there is nothing left to 
deregister*.  The real solution there is that the PML should fall back to a 
different protocol, but I'm told that doesn't happen and will require a bunch 
of work to make work properly.

An mpool that is aware of local processes lru's will solve the problem in most 
cases (all that I have seen) but yes, we need to rework the pml to handle the 
remaining cases. There are two things that need to be changed (from what I can 
tell):

 1) allow rget to fallback to send/put depending on the failure (I have 
fallback on put implemented in my branch-- and in my btl).
 2) need to devise new criteria on when we should progress the rdma_pending 
list to avoid deadlock.

#1 is fairly simple and I haven't given much though to #2.

-Nathan

Reply via email to