Re: [OMPI devel] Hostfiles - yet again

Rolf . Vandevaart Thu, 26 Jul 2007 09:33:37 -0400

Aurelien Bouteiller wrote:

Hi Ralph and everyone,
I just want to make sure the proposed usecases does not break one of thecurrent open MPI feature I require. For FT purposes, I need to get somespecific hosts (lets say with a better MTBF). Those hosts are not partof the MPI_COMM_WORLD but are used to deploy FT services (like eventloggers, checkpoint servers, etc). To enable collaboration betweencomputing nodes and those FT services, I use the usual MPI2 Dynamicswith MPI_Accept/Connect. This means that those different instances ofmpirun needs to share the same orte registry, so that they can establishthe MPI2 connect/accept trough the registered MPI_ports.
This background in place, my first concern is how the deployment maps tothe allocated resources. The nodes used to deploy FT services are"special". In typical usecase, I get machines with better MTFB, fasteror larger disks by requesting special properties to the resourcesallocation manager. I don't want those to be mixed with regular nodes inthe resulting hostfile: these scarce resources should hold only FTservices, no computing processes. As I understand things, I don't seeany way to avoid mpirun to deploy application processes on my "special"nodes if they are part of the same launch/allocation in your "filtering"usecase. Currently I proceed to two different mpirun with a single orteseed holding the registry. This way I get two different hostfiles, onefor computing nodes, one for FT services. I just want to make sureeverybody understood this requirement so that this feature does notdisappear in the brainstorming :]

With the use of resource managers, --host, and --hostfile this should
all be possible.

Next requirement is the ability to add during runtime some nodes to theinitial pool. Because node may fail (but it is the same with comm_spawnbasically) , I might need some (lot of) spare nodes to replace failedones. As I do not want to request for twice as many nodes as I need(after all, things could just go fine, why should I waste that manycomputing resources for idle spares ?), I definitely want to be able toallocate some new nodes to the pool of the already running machines. Asfar as I understand, this is impossible to achieve with the usecase2 andquite difficult in usecase1. In my opinion, having the ability to spawnon nodes which are not part of the initial hostfile is a key feature(and not only for FT purposes).

I am looking for more detail into the above issue.   What
resource manager are you using?

Ideally, we would prefer not to support this.  Any nodes
that you run on, or hope to run on, would be designated
at the start.   For example:

mpirun -np 1 --host a,b,c,d,e,f,g

This would cause the one process of the mpi job to start on host a.Then, the mpi job has available to it the other hosts should it decide

later to start a job on them.  However no ORTE daemons would
be started on those nodes until calls to MPI_Comm_spawn
occur.   So, the MPI job would not be consuming any resources
until called upon to.

Rolf

I know there have been some extra discussions on this subject.Unfortunately it looks like I am not part of the list where it happened.I hope those concerns have not been already discussed.


Aurelien

Ralph H Castain wrote:

Yo all

As you know, I am working on revamping the hostfile functionality to make it
work better with managed environments (at the moment, the two are
exclusive). The issue that we need to review is how we want the interaction
to work, both for the initial launch and for comm_spawn.

In talking with Jeff, we boiled it down to two options that I have
flow-charted (see attached):

Option 1: in this mode, we read any allocated nodes provided by a resource
manager (e.g., SLURM). These nodes establish a base pool of nodes that can
be used by both the initial launch and any dynamic comm_spawn requests. The
hostfile and any -host info is then used to select nodes from within that
pool for use with the specific launch. The initial launch would use the
-hostfile or -host command line option to provide that info - comm_spawn
would use the MPI_Info fields to provide similar info.

This mode has the advantage of allowing a user to obtain a large allocation,
and then designate hosts within the pool for use by an initial application,
and separately designate (via another hostfile or -host spec) another set of
those hosts from the pool to support a comm_spawn'd child job.

If no resource managed nodes are found, then the hostfile and -host options
would provide the list of hosts to be used. Again, comm_spawn'd jobs would
be able to specify their own hostfile and -host nodes.

The negative to this option is complexity - in the absence of a managed
allocation, I either have to deal with hostfile/dash-host allocations in the
RAS and then again in RMAPS, or I have "allocation-like" functionality
happening in RMAPS.


Option 2: in this mode, we read any allocated nodes provided by a resource
manager, and then filter those using the command line hostfile and -host
options to establish our base pool. Any spawn commands (both the initial one
and comm_spawn'd child jobs) would utilize this filtered pool of nodes.
Thus, comm_spawn is restricted to using hosts from that initial pool.

We could possibly extend this option by only using the hostfile in our
initial filter. In other words, let the hostfile downselect the resource
manager's allocation for the initial launch. Any -host options on the
command line would only apply to the hosts used to launch the initial
application. Any comm_spawn would use the hostfile-filtered pool of hosts.

The advantage here is simplicity. The disadvantage lies in flexibility for
supporting dynamic operations.


The major difference between these options really only impacts the initial
pool of hosts to be used for launches, both the initial one and any
subsequent comm_spawns. Barring any commentary, I will implement option 1 as
this provides the maximum flexibility.

Any thoughts? Other options we should consider?

Thanks
Ralph

------------------------------------------------------------------------


_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Hostfiles - yet again

Reply via email to