Hi Jeff,
Hi All,

On 08/07/12 18:51, Jeff Squyres wrote:
So I'm not 100% clear on what you mean here: when you set the OFED params to
allow registration of more memory than you have physically,
does the problem go away?

We are talking about machines with 24GB RAM (S) and 96GB RAM (L).
The default values for Mellanox/OFED parameter are 20/3 => 32GB registerable memory (RM) on both S and L. This is more than memory of S, but less than 2x memory of S, and less than memory of L.

If the OFED parameter are pimped to at least RM=64GB (20/3 => 21/3, 22/3, 24/3) there are no errors, I've just tested it with 8GB respectively 15.5 GB of data (starting usually 1x ppn).

If the OFED parameter are _not_ changed (=32GB RM) there is _no_ warning on S nodes; on L nodes this warns the user:
--------------------------------------------------------------------------
WARNING: It appears that your OpenFabrics subsystem is configured to only
.......
  Registerable memory:     32768 MiB
  Total memory:            98293 MiB
--------------------------------------------------------------------------
.. hardly surprising - the warning came if and only if (RM < memory).


If the OFED parameter are _not_ changed (=32GB RM) and I'm trying to send at least 8GB _in one chunk_ then the 'queue pair' error came out (see S_log.txt and my last mail). More exactly at least one process seem to die in MPI_Finalize (all output of the program is correct). The same error came out also on L nodes, surrounded by the above warning (L_log.txt).





From your log messages, the warning messages were from machines with
nearly 100GB RAM but only 32GB register-able.  But only one of those was
the same as one that showed QP creation failures.
>> So I'm not clear which was which.

Regardless: can you pump the MTT params up to allow registering all
of physical memory on those machines, and see if you get any failures?

as you can see on a node with 24GB memory and 32GB RM there can be a failure without any warning from Open MPI side :-(



To be clear: we're trying to determine if we should spend more effort
on making OMPI work properly in low-registered-memory-availabile
scenarios, or whether we should just emphasize
"go reset your MTT parameters to allow registering all of physical memory."

After making the experience with failures when only 1.5x of phys.mem. is allowed for registering I would follow Mellanox in "go reset your MTT to allow _twice_ the phys.memory".

So,
- if the OFED parameter are pimped everything is OK
- there is a [rare] combination when your great workaround did not catch.
- allowing 2x memory for being registered could be a good idea.

Does this make sense?

Best,
Paul Kapinos

P.S. The used example program is of course an synthetical thing but it is strongly sympathized with the Serpent software. (however serpent usually use chunks whereby the actual error arise if all the 8GB are send in one piece).

P.S.2 When all works, with increasing chunk size to HUGE values, the performance seem to became worse - sending all 15.5 GB in one piece is more than twice slower than sending with 200 mb pieces. See chunked_send.txt
(the first parameter is #doubles of data, the 2nd is #doubles in a chunk).

P.S.3 all experiments above with 1.6.1rc2

P.S.4. I'm also performing some linpack runs with 6x nodes and my very first impression is that increasing log_num_mtt to huge values is a bad idea (performance loss of some 5%). But let me finish it first...

--
Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, Center for Computing and Communication
Seffenter Weg 23,  D 52074  Aachen (Germany)
Tel: +49 241/80-24915
$ $MPIEXEC  -n 6 -m 1  -H 
linuxbmc0191,linuxbmc0037,linuxbmc0105,linuxbmc0219,linuxbmc0221,linuxbmc0246  
a.out  2080000000 2080000001
 MPI_Reduce in einem!
Elapsed time: 281.862924
$ $MPIEXEC  -n 6 -m 1  -H 
linuxbmc0191,linuxbmc0037,linuxbmc0105,linuxbmc0219,linuxbmc0221,linuxbmc0246  
a.out  2080000000 208000000
Elapsed time: 137.281245
$ $MPIEXEC  -n 6 -m 1  -H 
linuxbmc0191,linuxbmc0037,linuxbmc0105,linuxbmc0219,linuxbmc0221,linuxbmc0246  
a.out  2080000000 20800000
Elapsed time: 124.584747
$ $MPIEXEC  -n 6 -m 1  -H 
linuxbmc0191,linuxbmc0037,linuxbmc0105,linuxbmc0219,linuxbmc0221,linuxbmc0246  
a.out  2080000000 2080000
Elapsed time: 124.167813
process 1 starts test 
process 3 starts test 
process 5 starts test 
process 0 starts test 
Epsilon = 0.0000000010
process 2 starts test 
process 4 starts test 
#######  process 4 yields partial sum: 108000000.4
size, block: 1080000000 1080000001 
 MPI_Reduce in einem!
#######  process 2 yields partial sum: 108000000.4
size, block: 1080000000 1080000001 
 MPI_Reduce in einem!
#######  process 3 yields partial sum: 108000000.4
#######  process 0 yields partial sum: 108000000.4
#######  process 5 yields partial sum: 108000000.4
size, block: 1080000000 1080000001 
 MPI_Reduce in einem!
size, block: 1080000000 1080000001 
 MPI_Reduce in einem!
size, block: 1080000000 1080000001 
 MPI_Reduce in einem!
#######  process 1 yields partial sum: 108000000.4
size, block: 1080000000 1080000001 
 MPI_Reduce in einem!
--------------------------------------------------------------------------
A process failed to create a queue pair. This usually means either
the device has run out of queue pairs (too many connections) or
there are insufficient resources available to allocate a queue pair
(out of memory). The latter can happen if either 1) insufficient
memory is available, or 2) no more physical memory can be registered
with the device.

For more information on memory registration see the Open MPI FAQs at:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

Local host:             linuxbmc0373.rz.RWTH-Aachen.DE
Local device:           mlx4_0
Queue pair type:        Reliable connected (RC)
--------------------------------------------------------------------------
[linuxbmc0373.rz.RWTH-Aachen.DE:27808] *** An error occurred in MPI_Barrier
[linuxbmc0373.rz.RWTH-Aachen.DE:27808] *** on communicator MPI_COMM_WORLD
[linuxbmc0373.rz.RWTH-Aachen.DE:27808] *** MPI_ERR_OTHER: known error not in 
list
[linuxbmc0373.rz.RWTH-Aachen.DE:27808] *** MPI_ERRORS_ARE_FATAL: your MPI job 
will now abort
 
rank: 3  VmPeak:        25775796 kB
 difference occured: 2.597129 
[SOLL | IST] = ( 648000000.00 | 648000002.60)
Elapsed time: 88.500506
Master's Summe: 647999996.699953 

rank: 1  VmPeak:        25775784 kB

rank: 0  VmPeak:        17338172 kB

rank: 5  VmPeak:        17104212 kB

rank: 4  VmPeak:        25775692 kB
--------------------------------------------------------------------------
mpiexec has exited due to process rank 3 with PID 27807 on
node linuxbmc0373 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--------------------------------------------------------------------------

rank: 2  VmPeak:        25775696 kB
Failure executing command /opt/MPI/openmpi-1.6.1rc2/linux/intel/bin/mpiexec -x  
LD_LIBRARY_PATH -x  PATH -x  OMP_NUM_THREADS -x  MPI_NAME --hostfile 
/tmp/pk224850/cluster_2867/hostfile-29271 -np 6 memusage a.out 1080000000 
1080000001
--------------------------------------------------------------------------
WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory.  This can cause MPI jobs to
run with erratic performance, hang, and/or crash.

This may be caused by your OpenFabrics vendor limiting the amount of
physical memory that can be registered.  You should investigate the
relevant Linux kernel module parameters that control how much physical
memory can be registered, and increase them to allow registering all
physical memory on your machine.

See this Open MPI FAQ item for more information on these Linux kernel module
parameters:

    http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

  Local host:              linuxbmc0219.rz.RWTH-Aachen.DE
  Registerable memory:     32768 MiB
  Total memory:            98293 MiB
--------------------------------------------------------------------------
process 1 starts test 
process 3 starts test 
process 2 starts test 
process 0 starts test 
Epsilon = 0.0000000010
process 5 starts test 
process 4 starts test 
[cluster.rz.RWTH-Aachen.DE:40669] 5 more processes have sent help message 
help-mpi-btl-openib.txt / reg mem limit low
[cluster.rz.RWTH-Aachen.DE:40669] Set MCA parameter "orte_base_help_aggregate" 
to 0 to see all help / error messages
#######  process 0 yields partial sum: 108000000.4
#######  process 5 yields partial sum: 108000000.4
#######  process 2 yields partial sum: 108000000.4
size, block: 1080000000 1080000001 
 MPI_Reduce in einem!
#######  process 1 yields partial sum: 108000000.4
size, block: 1080000000 1080000001 
 MPI_Reduce in einem!
#######  process 3 yields partial sum: 108000000.4
#######  process 4 yields partial sum: 108000000.4
size, block: 1080000000 1080000001 
 MPI_Reduce in einem!
size, block: 1080000000 1080000001 
 MPI_Reduce in einem!
size, block: 1080000000 1080000001 
 MPI_Reduce in einem!
size, block: 1080000000 1080000001 
 MPI_Reduce in einem!
 difference occured: 2.597129 
[SOLL | IST] = ( 648000000.00 | 648000002.60)
Elapsed time: 81.982594
--------------------------------------------------------------------------
A process failed to create a queue pair. This usually means either
the device has run out of queue pairs (too many connections) or
there are insufficient resources available to allocate a queue pair
(out of memory). The latter can happen if either 1) insufficient
memory is available, or 2) no more physical memory can be registered
with the device.

For more information on memory registration see the Open MPI FAQs at:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

Local host:             linuxbmc0219.rz.RWTH-Aachen.DE
Local device:           mlx4_0
Queue pair type:        Reliable connected (RC)
--------------------------------------------------------------------------
[linuxbmc0219.rz.RWTH-Aachen.DE:20388] *** An error occurred in MPI_Barrier
[linuxbmc0219.rz.RWTH-Aachen.DE:20388] *** on communicator MPI_COMM_WORLD
[linuxbmc0219.rz.RWTH-Aachen.DE:20388] *** MPI_ERR_OTHER: known error not in 
list
[linuxbmc0219.rz.RWTH-Aachen.DE:20388] *** MPI_ERRORS_ARE_FATAL: your MPI job 
will now abort
 
rank: 3  VmPeak:        25775824 kB
Master's Summe: 647999996.699953 

rank: 0  VmPeak:        17338196 kB

rank: 4  VmPeak:        25775712 kB

rank: 2  VmPeak:        25775668 kB

rank: 5  VmPeak:        17104192 kB
--------------------------------------------------------------------------
mpiexec has exited due to process rank 3 with PID 20385 on
node linuxbmc0219 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--------------------------------------------------------------------------

rank: 1  VmPeak:        25775812 kB
Failure executing command /opt/MPI/openmpi-1.6.1rc2/linux/intel/bin/mpiexec -x  
LD_LIBRARY_PATH -x  PATH -x  OMP_NUM_THREADS -x  MPI_NAME --hostfile 
/tmp/pk224850/cluster_2867/hostfile-40538 -np 6 memusage a.out 1080000000 
1080000001

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to