Very interesting! Appreciate the info. My numbers are slightly better - as I've indicated, there is a NxN message exchange currently in the system that needs to be removed. With that commented out, the system scales roughly linearly with number of processes.

At 04:31 PM 7/28/2005, you wrote:

I have removed the ompi_ignores from the new bproc components I have been
working on and they are now the default for bproc. These new components
have several advantages over the old bproc component but mainly:
 - we now provide ptys support for standard IO
 - it should work better with threaded applications(although this has not
been tested).
 - We also now support Scyld bproc and old versions of LANL bproc using a
serial launch as opposed to the parallel launch used for newer bproc
versions. (Although I do not have a box to test this on so any reports on
how it works would be appreciated)
Their use is the same as before: set your NODES environment variable to a
comma delimited list of the nodes to run on.

The new launcher seems to be pretty scalable. Below are 2 charts where I
ran 'hostname' and a trivial mpi program on varying numbers of nodes with
both 1 and 2 processes per node (all times are in seconds).

Running Hostname:
Nodes 1 per node   2 per node
1        .162         .172
2        .202         .224
4        .243         .251
8        .260         .275
16       .305         .321
32       .360         .412
64       .524         .708
128     1.036        1.627

Running a trivial mpi process(Init/finalize)
Nodes 1 per node   2 per node
1        .33          .46
2        .44          .63
4        .56          .77
8        .61          .89
16       .71         1.1
32       .88         1.5
64      1.4          3.5
128     3.1          9.2

The frontend and nodes are dual Opteron 242 with 2 GB RAM and GigE.
I have been told that there are some NxN exchanges going on in the mpi
processes which are probably tainting the running time.

The launcher is split into 2 separate components. The general idea is:
 1. pls_bproc is called by orterun. It figures out the process mapping and
    launches orted's on the nodes
 2. pls_bproc_orted is called by orted. This module initializes either a
pty    or
    pipes, places symlinks to them in well know points of the filesystem, and
    sets up the io forwarding. It then sends an ack back to orterun.
 3. pls_bproc waits for an ack to come back from the orteds, then does
    parallel launches of the application processes. The number of launches is
    equal to the maximum number of processes on a node.

Let me know if there are any problems,

devel mailing list

Reply via email to