Re: [OMPI devel] Hostfiles - yet again

Ralph H Castain Thu, 26 Jul 2007 18:00:47 -0400

On 7/26/07 2:24 PM, "Aurelien Bouteiller" <boute...@cs.utk.edu> wrote:

> Ralph H Castain wrote:
>> After some investigation, I'm afraid that I have to report that this - as
>> far as I understand what you are doing - may no longer work in Open MPI in
>> the future (and I'm pretty sure isn't working in the trunk today except
>> [maybe] in the special case of hostfile - haven't verified that).
>> 
>> To ensure we are correctly communicating, let me reiterate what I understand
>> you are doing:
>>   
> Correct. Also consider that for my testing I use a batch scheduler that
> is not managed by orte right now and provide myself the hostfiles (This
> batch scheduler is named OAR and is in use on the grid5000 research
> facility in France).
> 
>> This was caused by mpirun itself processing its local environment and then
>> "pushing" it into the global registry. Keeping everything separated causes a
>> bookkeeper's headache and many lines of code that we would like to
>> eliminate.
>> 
>>   
> I see the point. I Agree there is very few benefit at allowing users to
> have different local environments on different mpirun instances; while
> it should be a real pain to have a clean code managing this. For my sole
> usage, the app_context feature you described is a more elegant and
> equivalent way of spawning my FT services. I will switch to this right
> away.
> 
> Still it might be of some use to be able to start different mpirun the
> same way you plan comm_spawn to work: sharing the same environment, but
> allowing for use of a different hostfile. The use case that comes in
> mind is "grid", where different batch schedulers are in use on each
> clusters, so you can't gather a single hostfile. This is not a feature I
> would fight for, but I can imagine some people might find it useful.


One of the design changes we made was to explicitly not support
multi-cluster operations from inside of Open MPI. Instead, people (not us)
are looking at adding a layer on top of Open MPI to handle the cross-cluster
coordination. I expect you'll hear more about those efforts in the
not-too-distant future.


> 
> More important for me is the ability to refill the hostfile with fresh
> hosts when some of the original ones died. Allocating an huge amount of
> spares preventively is just not the correct way to go. On the side I am
> not sure  that even the best comm_spawn you discussed could be of some
> help in this case as I do not want the new nodes to go in a different
> COMM_WORLD. Finding a way to update the registry and all the orted to do
> so is a much larger issue than simple spawning and I have not been
> really thinking about it for now. Maybe we should discuss this issue
> separately.

Ah, now -that- is a different topic indeed. I do plan to support a dynamic
add_hosts API as part of the revamped system. I'll try to flesh that out as
a separate RFC later.

Thanks
Ralph

> 
> Aurelien
>> Please feel free to comment. If this is a big enough issue to a large enough
>> audience, then we can try to find a way to solve it (assuming Open MPI's
>> community decides to support it).
>> 
>> Ralph
>> 
>> 
>>   
>>>>> Next requirement is the ability to add during runtime some nodes to the
>>>>> initial pool. Because node may fail (but it is the same with comm_spawn
>>>>> basically) , I might need some (lot of) spare nodes to replace failed
>>>>> ones. As I do not want to request for twice as many nodes as I need
>>>>> (after all, things could just go fine, why should I waste that many
>>>>> computing resources for idle spares ?), I definitely want to be able to
>>>>> allocate some new nodes to the pool of the already running machines. As
>>>>> far as I understand, this is impossible to achieve with the usecase2 and
>>>>> quite difficult in usecase1. In my opinion, having the ability to spawn
>>>>> on nodes which are not part of the initial hostfile is a key feature
>>>>> (and not only for FT purposes).
>>>>> 
>>>>>  
>>>>> 
>>>>>         
>>>> I am looking for more detail into the above issue.   What
>>>> resource manager are you using?
>>>> 
>>>> Ideally, we would prefer not to support this.  Any nodes
>>>> that you run on, or hope to run on, would be designated
>>>> at the start.   For example:
>>>> 
>>>> mpirun -np 1 --host a,b,c,d,e,f,g
>>>> 
>>>> This would cause the one process of the mpi job to start on host a.
>>>> Then, the mpi job has available to it the other hosts should it decide
>>>> later to start a job on them.  However no ORTE daemons would
>>>> be started on those nodes until calls to MPI_Comm_spawn
>>>> occur.   So, the MPI job would not be consuming any resources
>>>> until called upon to.
>>>>       
>>> This has actually been the subject of multiple threads on the user list and
>>> is considered a critical capability by some users and vendors. I believe
>>> there is little problem in allowing those systems that can support it to
>>> dynamically add nodes to ORTE via some API into the resource manager. At the
>>> moment, none of the RMs support it, but LSF will (and TM at least may)
>>> shortly do so, and some of their customers are depending upon it.
>>> 
>>> The problem is that job startup could be delayed for significant time if all
>>> hosts must be preallocated. Admittedly, this raises all kinds of issues
>>> about how long the job could be stalled waiting for the new hosts. However,
>>> as the other somewhat exhaustive threads have discussed, there are computing
>>> models that can live with this uncertainty, and RMs that will provide async
>>> callbacks to allow the rest of the app to continue working while waiting.
>>> 
>>> Just my $0.00002 - again, this goes back to...are there use-cases and
>>> customers to which Open MPI is simply going to say "we won't support that"?
>>> 
>>>     
>>>> Rolf
>>>> 
>>>>       
>>>>> I know there have been some extra discussions on this subject.
>>>>> Unfortunately it looks like I am not part of the list where it happened.
>>>>> I hope those concerns have not been already discussed.
>>>>> 
>>>>> Aurelien
>>>>> 
>>>>> Ralph H Castain wrote:
>>>>>  
>>>>> 
>>>>>         
>>>>>> Yo all
>>>>>> 
>>>>>> As you know, I am working on revamping the hostfile functionality to make
>>>>>> it
>>>>>> work better with managed environments (at the moment, the two are
>>>>>> exclusive). The issue that we need to review is how we want the
>>>>>> interaction
>>>>>> to work, both for the initial launch and for comm_spawn.
>>>>>> 
>>>>>> In talking with Jeff, we boiled it down to two options that I have
>>>>>> flow-charted (see attached):
>>>>>> 
>>>>>> Option 1: in this mode, we read any allocated nodes provided by a
>>>>>> resource
>>>>>> manager (e.g., SLURM). These nodes establish a base pool of nodes that
>>>>>> can
>>>>>> be used by both the initial launch and any dynamic comm_spawn requests.
>>>>>> The
>>>>>> hostfile and any -host info is then used to select nodes from within that
>>>>>> pool for use with the specific launch. The initial launch would use the
>>>>>> -hostfile or -host command line option to provide that info - comm_spawn
>>>>>> would use the MPI_Info fields to provide similar info.
>>>>>> 
>>>>>> This mode has the advantage of allowing a user to obtain a large
>>>>>> allocation,
>>>>>> and then designate hosts within the pool for use by an initial
>>>>>> application,
>>>>>> and separately designate (via another hostfile or -host spec) another set
>>>>>> of
>>>>>> those hosts from the pool to support a comm_spawn'd child job.
>>>>>> 
>>>>>> If no resource managed nodes are found, then the hostfile and -host
>>>>>> options
>>>>>> would provide the list of hosts to be used. Again, comm_spawn'd jobs
>>>>>> would
>>>>>> be able to specify their own hostfile and -host nodes.
>>>>>> 
>>>>>> The negative to this option is complexity - in the absence of a managed
>>>>>> allocation, I either have to deal with hostfile/dash-host allocations in
>>>>>> the
>>>>>> RAS and then again in RMAPS, or I have "allocation-like" functionality
>>>>>> happening in RMAPS.
>>>>>> 
>>>>>> 
>>>>>> Option 2: in this mode, we read any allocated nodes provided by a
>>>>>> resource
>>>>>> manager, and then filter those using the command line hostfile and -host
>>>>>> options to establish our base pool. Any spawn commands (both the initial
>>>>>> one
>>>>>> and comm_spawn'd child jobs) would utilize this filtered pool of nodes.
>>>>>> Thus, comm_spawn is restricted to using hosts from that initial pool.
>>>>>> 
>>>>>> We could possibly extend this option by only using the hostfile in our
>>>>>> initial filter. In other words, let the hostfile downselect the resource
>>>>>> manager's allocation for the initial launch. Any -host options on the
>>>>>> command line would only apply to the hosts used to launch the initial
>>>>>> application. Any comm_spawn would use the hostfile-filtered pool of
>>>>>> hosts.
>>>>>> 
>>>>>> The advantage here is simplicity. The disadvantage lies in flexibility
>>>>>> for
>>>>>> supporting dynamic operations.
>>>>>> 
>>>>>> 
>>>>>> The major difference between these options really only impacts the
>>>>>> initial
>>>>>> pool of hosts to be used for launches, both the initial one and any
>>>>>> subsequent comm_spawns. Barring any commentary, I will implement option 1
>>>>>> as
>>>>>> this provides the maximum flexibility.
>>>>>> 
>>>>>> Any thoughts? Other options we should consider?
>>>>>> 
>>>>>> Thanks
>>>>>> Ralph
>>>>>>  
>>>>>> ------------------------------------------------------------------------
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>    
>>>>>> 
>>>>>>           
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>  
>>>>> 
>>>>>         
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>       
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>     
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>   
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Hostfiles - yet again

Reply via email to