On Jul 12, 2012, at 12:04 PM, Paul Kapinos wrote:

> a long time ago, I reported about an error in Open MPI:
> http://www.open-mpi.org/community/lists/users/2012/02/18565.php
> 
> Well, in the 1.6 the behaviour has changed: the test case don't hang forever 
> and block an InfiniBand interface, but seem to run through, and now this 
> error message is printed:
> --------------------------------------------------------------------------
> The OpenFabrics (openib) BTL failed to register memory in the driver.
> Please check /var/log/messages or dmesg for driver specific failure
> reason.

We updated our mechanism, but accidentally left this warning message in (it has 
since been removed).

Here's what's happening: Mellanox changed the default amount of registered 
memory that is available -- they dramatically reduced it.  We haven't gotten a 
good answer yet as to *why* this change was made.

You can change some kernel-level parameters to increase it again, and then OMPI 
should work fine.  Here's an IBM article about it:

http://www.ibm.com/developerworks/wikis/display/hpccentral/Using+RDMA+with+pagepool+larger+than+8GB

And here's some comments that Mellanox made on a ticket about this issue 
(including some corrections/clarifications to that IBM article):

    https://svn.open-mpi.org/trac/ompi/ticket/3134#comment:12

-----

Basically, what's happening is that OMPI is behaving badly when it runs out of 
registered memory.  We have tried two things to make this better (i.e., still 
perform *correctly*, albeit at a lower performance level), and we're not sure 
yet whether they work properly.

1. When OMPI tries to register more memory for an RDMA message transaction and 
fails, it falls back to send-receive (where we already have pre-registered 
memory available to use).  However, this can still end up hanging because of 
OMPI's "lazy connection" scheme -- where OMPI doesn't open IB connections 
between MPI processes until the first time each pair of processes communicate.  
So if OMPI runs out of registered memory and then tries to open a new IB 
connection to a new peer -- kaboom.

2. When OMPI starts it, it guesstimates how much memory can be registered and 
equally divides it between all the OMPI processes *in that job* on the same 
node.  We had mixed reports of this working or not.  I made a 1.6.x tarball 
with this fix in it, if you could give it a whirl (with the default low 
registered memory kernel parameters, to ensure that you can invoke the "out of 
registered memory" issue):

    http://www.open-mpi.org/~jsquyres/unofficial/
    Use the openmpi-1.6.1ticket3131r26612M.tar.bz2 tarball

#2 is the latest attempt to fix it, but we haven't had good testing of it.  
Could you give it a whirl and let us know what happens?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to