Re: [OMPI devel] Hostfiles - yet again

Aurelien Bouteiller Thu, 26 Jul 2007 16:28:17 -0400

Ralph H Castain wrote:

After some investigation, I'm afraid that I have to report that this - as
far as I understand what you are doing - may no longer work in Open MPI in
the future (and I'm pretty sure isn't working in the trunk today except
[maybe] in the special case of hostfile - haven't verified that).


To ensure we are correctly communicating, let me reiterate what I understand
you are doing:

Correct. Also consider that for my testing I use a batch scheduler thatis not managed by orte right now and provide myself the hostfiles (Thisbatch scheduler is named OAR and is in use on the grid5000 researchfacility in France).

This was caused by mpirun itself processing its local environment and then
"pushing" it into the global registry. Keeping everything separated causes a
bookkeeper's headache and many lines of code that we would like to
eliminate.

I see the point. I Agree there is very few benefit at allowing users tohave different local environments on different mpirun instances; whileit should be a real pain to have a clean code managing this. For my soleusage, the app_context feature you described is a more elegant andequivalent way of spawning my FT services. I will switch to this rightaway.

Still it might be of some use to be able to start different mpirun thesame way you plan comm_spawn to work: sharing the same environment, butallowing for use of a different hostfile. The use case that comes inmind is "grid", where different batch schedulers are in use on eachclusters, so you can't gather a single hostfile. This is not a feature Iwould fight for, but I can imagine some people might find it useful.

More important for me is the ability to refill the hostfile with freshhosts when some of the original ones died. Allocating an huge amount ofspares preventively is just not the correct way to go. On the side I amnot sure that even the best comm_spawn you discussed could be of somehelp in this case as I do not want the new nodes to go in a differentCOMM_WORLD. Finding a way to update the registry and all the orted to doso is a much larger issue than simple spawning and I have not beenreally thinking about it for now. Maybe we should discuss this issueseparately.


Aurelien

Please feel free to comment. If this is a big enough issue to a large enough
audience, then we can try to find a way to solve it (assuming Open MPI's
community decides to support it).

Ralph

Next requirement is the ability to add during runtime some nodes to the
initial pool. Because node may fail (but it is the same with comm_spawn
basically) , I might need some (lot of) spare nodes to replace failed
ones. As I do not want to request for twice as many nodes as I need
(after all, things could just go fine, why should I waste that many
computing resources for idle spares ?), I definitely want to be able to
allocate some new nodes to the pool of the already running machines. As
far as I understand, this is impossible to achieve with the usecase2 and
quite difficult in usecase1. In my opinion, having the ability to spawn
on nodes which are not part of the initial hostfile is a key feature
(and not only for FT purposes).

I am looking for more detail into the above issue.   What
resource manager are you using?

Ideally, we would prefer not to support this.  Any nodes
that you run on, or hope to run on, would be designated
at the start.   For example:

mpirun -np 1 --host a,b,c,d,e,f,g

This would cause the one process of the mpi job to start on host a.
Then, the mpi job has available to it the other hosts should it decide
later to start a job on them.  However no ORTE daemons would
be started on those nodes until calls to MPI_Comm_spawn
occur.   So, the MPI job would not be consuming any resources
until called upon to.

This has actually been the subject of multiple threads on the user list and
is considered a critical capability by some users and vendors. I believe
there is little problem in allowing those systems that can support it to
dynamically add nodes to ORTE via some API into the resource manager. At the
moment, none of the RMs support it, but LSF will (and TM at least may)
shortly do so, and some of their customers are depending upon it.

The problem is that job startup could be delayed for significant time if all
hosts must be preallocated. Admittedly, this raises all kinds of issues
about how long the job could be stalled waiting for the new hosts. However,
as the other somewhat exhaustive threads have discussed, there are computing
models that can live with this uncertainty, and RMs that will provide async
callbacks to allow the rest of the app to continue working while waiting.

Just my $0.00002 - again, this goes back to...are there use-cases and
customers to which Open MPI is simply going to say "we won't support that"?

Rolf

I know there have been some extra discussions on this subject.
Unfortunately it looks like I am not part of the list where it happened.
I hope those concerns have not been already discussed.

Aurelien

Ralph H Castain wrote:

Yo all

As you know, I am working on revamping the hostfile functionality to make
it
work better with managed environments (at the moment, the two are
exclusive). The issue that we need to review is how we want the interaction
to work, both for the initial launch and for comm_spawn.

In talking with Jeff, we boiled it down to two options that I have
flow-charted (see attached):

Option 1: in this mode, we read any allocated nodes provided by a resource
manager (e.g., SLURM). These nodes establish a base pool of nodes that can
be used by both the initial launch and any dynamic comm_spawn requests. The
hostfile and any -host info is then used to select nodes from within that
pool for use with the specific launch. The initial launch would use the
-hostfile or -host command line option to provide that info - comm_spawn
would use the MPI_Info fields to provide similar info.

This mode has the advantage of allowing a user to obtain a large
allocation,
and then designate hosts within the pool for use by an initial application,
and separately designate (via another hostfile or -host spec) another set
of
those hosts from the pool to support a comm_spawn'd child job.

If no resource managed nodes are found, then the hostfile and -host options
would provide the list of hosts to be used. Again, comm_spawn'd jobs would
be able to specify their own hostfile and -host nodes.

The negative to this option is complexity - in the absence of a managed
allocation, I either have to deal with hostfile/dash-host allocations in
the
RAS and then again in RMAPS, or I have "allocation-like" functionality
happening in RMAPS.


Option 2: in this mode, we read any allocated nodes provided by a resource
manager, and then filter those using the command line hostfile and -host
options to establish our base pool. Any spawn commands (both the initial
one
and comm_spawn'd child jobs) would utilize this filtered pool of nodes.
Thus, comm_spawn is restricted to using hosts from that initial pool.

We could possibly extend this option by only using the hostfile in our
initial filter. In other words, let the hostfile downselect the resource
manager's allocation for the initial launch. Any -host options on the
command line would only apply to the hosts used to launch the initial
application. Any comm_spawn would use the hostfile-filtered pool of hosts.

The advantage here is simplicity. The disadvantage lies in flexibility for
supporting dynamic operations.


The major difference between these options really only impacts the initial
pool of hosts to be used for launches, both the initial one and any
subsequent comm_spawns. Barring any commentary, I will implement option 1
as
this provides the maximum flexibility.

Any thoughts? Other options we should consider?

Thanks
Ralph

------------------------------------------------------------------------


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Hostfiles - yet again

Reply via email to