On Jul 15, 2007, at 11:18 PM, Matthew Moskewicz wrote:

i'll probably just continue experimenting on my own for the moment (tracking any updates to the main trunk LSF support) to see if i can figure it out. any advice the best way to get such back support into trunk, if and when if exists
/ is working?

The *best* way would be for you to sign a third-party agreement - see the web site for details and a copy. Barring that, the only option would be to submit the code through either Jeff or I. We greatly prefer the agreement
 method as it is (a) less burdensome on us and (b) gives you greater
 flexibility.

i'll talk to 'the man' -- it should be okay ... eventually, at least ...

See http://www.open-mpi.org/community/contribute/ for details. As an open project, we always welcome new developers, but we do need to keep the IP tidy.

I can't speak to the motivation behind MPI-2 - the others in the group can do a much better job of that. What I can say is that we started out with a design to support such modes of operation as dynamic farms, but the group has been moving away from it due to a combination of performance impacts, reliability, and (frankly) lack of interest from our user community. Our intent now is to cut the RTE back to the basics required to support the MPI standard, including MPI-2 - which arguably says nothing about dynamic
 resource allocation.

that's true -- dynamic processes can be useful even under a static
allocation. in fact, in the short term for my particular application,
i'll probably do just that -- the user picks an initial allocation,
and then i just do the best i can. hopefully the allocations will be
'small enough' to get away without dynamic acquisition for a while (a
year?).

FWIW, our experience with the MPI layer has shown that the vast majority of applications only need a specific set of initial resources (hosts/cpus) and then just use those. We have seen only a small class of applications that truly benefit from dynamically adding / removing resources in the middle of the run. The canonical manager/worker model fits this criteria (i.e., benefits from dynamically adding/removing resources), but as you noted, it also works just fine with a static set of resources. FWIW, I've seen many MPI applications written with the manager/worker model to ease their startup with a variable number of nodes (e.g., under a resource manager) -- they'll just launch as many processes as they get in their job then then manager/worker from there to "discover" how many processes they got, use them all, etc.

beyond that, i guess i'm just one of those guys that thinks
it's a shame that MPI supplanted pvm so long ago in the first place.
and yes, i already looked into modifying pvm instead, no thank you ...
;)

A religious argument. ;-) There were certainly good things about PVM, and MPI managed to take at least some of them.

Not to say we won't support it - just indicating that such support will have lower priority and that the system will be designed primarily for other priorities. So dynamic resource allocation will have to be considered as an
 "exception case", with all the attendant implications.

fair enough. i'm still hoping it won't be too exceptional, really. on
a related note, perhaps is it possible to 'join' running openMPI jobs
(using nameservers or whatnot)? if so, then application level
workarounds are also possible -- and can even be automated if the
application just launches a whole new copy of itself via whatever
top-level means was used to launch itself in the first place.

MPI-2 does support the MPI_COMM_JOIN and MPI_COMM_ACCEPT/ MPI_COMM_CONNECT models. We do support this in Open MPI, but the restrictions (in terms of ORTE) may not be sufficient for you.

Some other random notes in no particular order:

- As you noted, the LSF support is *very* new; it was just added last week.

- It also likely doesn't work yet; we started the integration work and ran into a technical issue that required further discussion with Platform. They're currently looking into it; we stopped the LSF work in ORTE until they get back to us.

- FWIW, one of the main reasons OMPI/ORTE didn't add extensive/ flexible support for dynamic addition of resources was the potential for queue time. Many systems run "full" all the time, so if you try to acquire more resources, you could just sit in a queue for minutes/ hours/days/weeks before getting nodes. While it is certainly possible to program with this model, we didn't really want to get into the rats nest of corner cases that this would entail, especially since very few users are asking for it.

- That being said, MPI_THREAD_MULTIPLE and MPI_COMM_SPAWN *might* offer a way out here. But I think a) THREAD_MULTIPLE isn't working yet (other OMPI members are working on this), and b) even when THREAD_MULTIPLE works, there will be ORTE issues to deal with (canceling pending resource allocations, etc.). Ralph mentioned that someone else is working on such things on the TM/PBS/Torque side; I haven't followed that effort closely.

well, certainly part of the issue is the need (or at least strong
preference) to support 6.2 -- but read on.

hmm, i'll need to review the APIs in more detail, but here is my
current understanding:
there appear to be some overlaps between the ls_* and lsb_* functions,
but they seem basically compatible as far as i can tell. almost all
the functions have a command line version as well, for example:
lsb_submit()/bsub

lsb_getalloc()/none and lsb_launch()/blaunch are new with LSF 7.0, but
appear to just be a different (simpler) interface to existing
functionality in the LSB_* env vars and the ls_rexec()/lsgrun commands
-- although, as you say, perhaps platform will hook or enhance them
later. but, the key issue is that lsb_launch() just starts tasks -- it
does not perform or interact with the queue or job control (much?).
so, you can't use these functions to get an allocation in the first
place, and you have to be careful not to use them as a way around the
queuing system.

[ as a side note, the function ls_rexecv()/lsgrun is the one i have
heard admins do not like because it can break queuing/accounting, and
might try to disable somehow. i don't really buy that, because it's
not you can disable it and have the system still work, since (as
above) || job launching depends on it. i guess if you really don't
care about || launching maybe you could. but, if used properly after a
proper allocation i don't think there should (or even can) be a
problem. ]

so, lsb_submit()/bsub is a combination allocate/launch -- you specify
the allocation size you want, and when it's all ready, it runs the
'job' (really the job launcher) only on one (randomly chosen) 'head'
node from the allocation, with the env vars set so the launcher can
use ls_rexec/lsgrun functions to start the rest of the job. there are
of course various script wrappers you can use (mpijob, pvmjob, etc)
instead of your 'real job'. then, i think lsf *should* try to track
what processes get started via the wrapper / head process so it knows
they are part of the same job. i dunno if it really does that -- but,
my guess is that at the least it assumes the allocation is in use
until the original process ends. in any case, the wrapper / head
process examines the environment vars and uses ls_rexec()/lsgrun or
the like to actually run N copies of the 'real job' executable. in
7.0, it can conveniently use lsb_getalloc() and lsb_launch(), but that
doesn't really change any semantics as far as i know. one could
imaging that calling lsb_launch() instead of ls_rexec() might be
preferable from a process tracking point of view, but i don't see why
Platform couldn't hook ls_rexec() just as well as lsb_launch().

i really need to get a little more confidence on that issue, since
it's what determines what actions will (or perhaps already do in
practice) 'break' the queuing/reporting system.

there are some 'allocate only' functions as well, such as
ls_placereq()/lsplace -- these can just return a host list / set the
env vars without running anything at first. apparently, you need to
run something 'soon' on the resultant hosts or the load balancer might
get confused and reuse them. also, since this doesn't seem to go
through the queues, it's probably not a viable set of functions to
really use. a red herring, as far as i'm concerned.

there is also an lsb_runjob() that is similar to lsb_launch(), but for
an already submitted job. so, if one were to lsb_sumbit() with an
option set to never launch it automatically, and then one were to run
lsb_runjob(), you can avoid the queue and/or force the use of certain
hosts? i guess this is also not a good function to use, but at least
the queuing system would be aware of any bad behavior (queue skipping
via ls_placereq() to get extra hosts, for instance) in this case ...

there does *not* appear to be an option to lsb_submit() that allows a
non-blocking programmatic callback when allocation is complete. if
there was, it would need to deal with process tracking issues, or
maybe just merge the old and new jobs somehow in that case.

so to speak to the original point, it would indeed be nice to be able
to do additional allocations (and then an lsb_launch) with a simple
programmatic interface for completeness, but i don't see one. however,
lsb_submit() is pretty close -- it makes a 'new' job, but i think
that's okay. the initial daemon that gets run on the 'head' (i.e.
randomly chosen) node of the new job will run an lsb_launch() or
similar to start up the remaining N-1 daemons as children -- thus
hopefully keeping the queuing system and process tracking happy. or
you could use some LSF option / wrapper script to tell it to run the
same daemon on all N hosts for you, if a some suitable option/wrapper
exists anyway. so, in summary lsb_sumit() does allocation + one
(non-optional) launch on allocation completion. lsb_launch() (or
similar) does only launching, should probably only be run from the
single process started from an lsb_submit(), and should only launch
things on the allocation given by lsb_getalloc() (or env vars).

I am certainly not an expert on LSF (nor its API) -- I only started using it last week! Do you have any contacts to ask at Platform? They would likely be the best ones to discuss this with.

--
Jeff Squyres
Cisco Systems

Reply via email to