On Fri, 9 Mar 2012, Jeffrey Squyres wrote:
On Mar 9, 2012, at 1:14 PM, George Bosilca wrote:
The hang occurs because there is nothing on the lru to deregister and
ibv_reg_mr (or GNI_MemRegister in the uGNI case) fails. The PML then puts the
request on its rdma pending list and continues. If any message comes in the
rdma pending list is progressed. If not it hangs indefinitely!
Unlike Jeff, I'm not in favor of adding bandages. If the cause is understood,
then there _is_ a fix, and that fix should be the target of any efforts.
The fix that Nathan proposes is not a complete fix -- we can still run out of
memory and hang. You should read the open tickets and prior emails we have
sent about this -- Nathan's fix merely delays when we will run out of
registered memory. It does not solve the underlying problem.
Correct.
In general I have found the underlying cause of the hang is due to an imbalance
of registrations between processes on a node. i.e the hung process has an empty
lru but other processes could deregister. I am working on a new mpool (grdma)
to handle the imbalance. The new mpool will allow a process to request that one
of its peers deregisters from it lru if possible. I have a working proof of
concept implementation that uses a posix shmem segment and a progress function
to handle signaling and dereferencing. With it I no longer see hangs with IMB
Alltoall/Alltoallv on uGNI (without putting an artificial limit on the number
of registrations). I will test the mpool on infiniband later today.
If a solution already exists I don't see why we have to have the message code.
Based on its urgency, I'm confident your patch will make its way into the 1.5
quite easily.
Depending on the timing, this might go to 1.6 (1.5.5 has waited for too long,
and this is not a regression). Keep in mind that the problem has been around
for *a long, long time*, which is why I approved the diag message (i.e.,
because a real solution is still nowhere in sight). The real issue is that we
can still run out of registered memory *and there is nothing left to
deregister*. The real solution there is that the PML should fall back to a
different protocol, but I'm told that doesn't happen and will require a bunch
of work to make work properly.
An mpool that is aware of local processes lru's will solve the problem in most
cases (all that I have seen) but yes, we need to rework the pml to handle the
remaining cases. There are two things that need to be changed (from what I can
tell):
1) allow rget to fallback to send/put depending on the failure (I have
fallback on put implemented in my branch-- and in my btl).
2) need to devise new criteria on when we should progress the rdma_pending
list to avoid deadlock.
#1 is fairly simple and I haven't given much though to #2.
-Nathan