Re: [OMPI devel] Fwd: lsf support / farm use models

2007-07-18 Thread Matthew Moskewicz

hi,

first of all, thanks for the info bill! i think i'm really starting to
piece things together now. you are right in that i'm working with a
6.x (6.2 with 6.1 devel libs ;) install here at cadence, without the
HPC extensions AFAIK. also, i think that are customers are mostly in
the same position -- i assume that the HPC extensions cost extra? or
perhaps admins just don't bother to install them.

so, there are at least three cases to consider:
LSF 7.0 or greater
LSF 6.x /w HPC
LSF 6.x 'base'

i'll try to gather more data, but my feeling it that the market
penetration of both HPC and LSF 7.0 is low in our marker (EDA vendors
and customers). i'd love to just stall until 7.0 is widely available,
but perhaps in the mean time it would be nice to have some backward
support for LSF 6.0 'base'. it seems like supporting LSF 6.x /w HPC
might not be too useful, since:
a) it's not clear that the 'built in' "bsub -n N -a openmpi foo"
support will work with an MPI-2 dynamic-spawning application like mine
(or does it?),
b) i've heard that manually interfacing with the  parallel application
manager directly is tricky?
c) most importantly, it's not clear than any of our customers have the
HPC support, and certainly not all of them, so i need to support LSF
6.0 'base' anyway -- it only needs to work until 7.0 is widely
available (< 1 year? i really have no idea ... will Platform end
support for 6.x at some particular time? or otherwise push customers
to upgrade? perhaps cadence can help there too ...) .

under LSF 7.0 it looks like things are okay and that open-mpi will
support it in a released version 'soon' (< 6 months? ). sooner than
our customer wil have LSF 7.0 anyway, so that's fine.

as for LSF 6.0 'base', there are two workarounds that i see, and a
couple key questions that remain:

1) use bsub -n N, followed by N-1 ls_rtaske() calls (or similar).
while ls_rtaske() may not 'force' me to follow the queuing rules, if i
only launch on the proper machines, i should be okay, right? i don't
think IO and process marshaling (i'm not sure exactly what you mean by
that) are a problem since openmpi/orted handles those issues, i think?

2) use only bsub's of single processes, using some initial wrapper
script that bsub's all the jobs (master + N-1 slaves) needed to reach
the desired static allocation for openmpi. this seems to be what my
internal guy is suggesting is 'required'. integration with openmpi
might not be too hard, using suitable trickery. for example, the
wrapper script launches some wrapper processes that are basically
rexec daemons. the master waits for them to come up in the ras/lsf
component (tcp notify, perhaps via the launcher machine to avoid
needing to know the master hostname a priori), and then the pls/lsf
component uses the thin rexec daemons to launch orted. seems like a
bit of a silly workaround, but it does seem to both keep the queuing
system happy as well as not need ls_rtaske() or similar.

[ Note: (1) will fail if admins disable the ls_rexec() type of
functionality, but on a LSF 6.0 'base' system, this would seem to
disable all || job launching -- i.e. the shipped mpijob/pvmjob all use
lsgrun and such, so they would be disabled -- is there any other way i
could start the sub-processes within my allocation in that case? can i
just have bsub start N copies of something (maybe orted?)? that seems
like it might be hard to integrate with openmpi, though -- in that
case, i'd probably just only impliment option (2)]

Matt.

On 7/17/07, Bill McMillan  wrote:





> there appear to be some overlaps between the ls_* and lsb_* functions,
> but they seem basically compatible as far as i can tell. almost all
> the functions have a command line version as well, for example:
> lsb_submit()/bsub

  Like openmpi and orte, there are two layers in LSF.  The ls_* API's
  talk to what is/was historically called "LSF Base" and the lsb_* API's
  talk to what is/was historically called "LSF Batch".

[SNIP]

  Regards,
  Bill


-
Bill McMillan
Principal Technical Product Manager
Platform Computing



Re: [OMPI devel] devel Digest, Vol 801, Issue 1

2007-07-16 Thread Matthew Moskewicz

hi again,


>>> i'll probably just continue experimenting on my own for the
>>> moment (tracking
>>> any updates to the main trunk LSF support) to see if i can figure
>>> it out. any
>>> advice the best way to get such back support into trunk, if and
>>> when if exists
>>> / is working?
>>
>> The *best* way would be for you to sign a third-party agreement -
>> see the
>>  web site for details and a copy. Barring that, the only option
>> would be to
>>  submit the code through either Jeff or I. We greatly prefer the
>> agreement
>>  method as it is (a) less burdensome on us and (b) gives you greater
>>  flexibility.
>
> i'll talk to 'the man' -- it should be okay ... eventually, at
> least ...

See http://www.open-mpi.org/community/contribute/ for details.  As an
open project, we always welcome new developers, but we do need to
keep the IP tidy.



will do.


MPI-2 does support the MPI_COMM_JOIN and MPI_COMM_ACCEPT/
MPI_COMM_CONNECT models.  We do support this in Open MPI, but the
restrictions (in terms of ORTE) may not be sufficient for you.


perhaps i'll experiment -- any clues as to what the orte restrictions might be?



Some other random notes in no particular order:

- As you noted, the LSF support is *very* new; it was just added last
week.

- It also likely doesn't work yet; we started the integration work
and ran into a technical issue that required further discussion with
Platform.  They're currently looking into it; we stopped the LSF work
in ORTE until they get back to us.



i see -- i might be trying to work on the 6.x support today. can you
give me any hints on what the problem was in case i run into the same
issue?


- FWIW, one of the main reasons OMPI/ORTE didn't add extensive/
flexible support for dynamic addition of resources was the potential
for queue time.  Many systems run "full" all the time, so if you try
to acquire more resources, you could just sit in a queue for minutes/
hours/days/weeks before getting nodes.  While it is certainly
possible to program with this model, we didn't really want to get
into the rats nest of corner cases that this would entail, especially
since very few users are asking for it.



yeah, it does seem like the queuing issue is critical. i think as long
as the requests for more resources are non-blocking, and the
application itself can deal with that, it shouldn't create too many
corner cases. in fact, if the application wants to block (potentially
for a long time) that might be okay too (i.e. on the initial big
allocation, just after some startup routine determines the needed
initial resources).


- That being said, MPI_THREAD_MULTIPLE and MPI_COMM_SPAWN *might*
offer a way out here.  But I think a) THREAD_MULTIPLE isn't working
yet (other OMPI members are working on this), and b) even when
THREAD_MULTIPLE works, there will be ORTE issues to deal with
(canceling pending resource allocations, etc.).  Ralph mentioned that
someone else is working on such things on the TM/PBS/Torque side; I
haven't followed that effort closely.



it seems that MPI_THREAD_MULTIPLE is to be avoided for now, but there
are perhaps other workarounds (using threads in other ways, etc.).
also, i'd love to hear about the existing efforts -- i'm hoping
someone working on them might be reading this ... ;)


> well, certainly part of the issue is the need (or at least strong
> preference) to support 6.2 -- but read on.
>

[SNIP LSF API info/guesswork]


I am certainly not an expert on LSF (nor its API) -- I only started
using it last week!  Do you have any contacts to ask at Platform?
They would likely be the best ones to discuss this with.


i'm in the same boat. i'll try to talk to the people here at cadence
that might have said contacts at Platform.



--
Jeff Squyres
Cisco Systems




Matt.


[OMPI devel] Fwd: lsf support / farm use models

2007-07-16 Thread Matthew Moskewicz

hi again,

[i'm going to snip out the sections that seem resolved]
[also, sorry about mutating the subject last time -- oops.]



This sounds fine - you'll find that the bproc pls does the exact same thing.
 In that case, we use #ifdefs since the APIs are actually different between
 the versions - we just create a wrapper inside the bproc pls code for the
 older version so that we can always call the same API. I'm not sure what the
 case will be in LSF - I believe the function calls are indeed different, so
 you might be able to use the same approach.


okay



> i'll probably just continue experimenting on my own for the moment (tracking
 > any updates to the main trunk LSF support) to see if i can figure it out. any
 > advice the best way to get such back support into trunk, if and when if 
exists
 > / is working?


The *best* way would be for you to sign a third-party agreement - see the
 web site for details and a copy. Barring that, the only option would be to
 submit the code through either Jeff or I. We greatly prefer the agreement
 method as it is (a) less burdensome on us and (b) gives you greater
 flexibility.



i'll talk to 'the man' -- it should be okay ... eventually, at least ...



I can't speak to the motivation behind MPI-2 - the others in the group can
 do a much better job of that. What I can say is that we started out with a
 design to support such modes of operation as dynamic farms, but the group
 has been moving away from it due to a combination of performance impacts,
 reliability, and (frankly) lack of interest from our user community. Our
 intent now is to cut the RTE back to the basics required to support the MPI
 standard, including MPI-2 - which arguably says nothing about dynamic
 resource allocation.


that's true -- dynamic processes can be useful even under a static
allocation. in fact, in the short term for my particular application,
i'll probably do just that -- the user picks an initial allocation,
and then i just do the best i can. hopefully the allocations will be
'small enough' to get away without dynamic acquisition for a while (a
year?). beyond that, i guess i'm just one of those guys that thinks
it's a shame that MPI supplanted pvm so long ago in the first place.
and yes, i already looked into modifying pvm instead, no thank you ...
;)



Not to say we won't support it - just indicating that such support will have
 lower priority and that the system will be designed primarily for other
 priorities. So dynamic resource allocation will have to be considered as an
 "exception case", with all the attendant implications.



fair enough. i'm still hoping it won't be too exceptional, really. on
a related note, perhaps is it possible to 'join' running openMPI jobs
(using nameservers or whatnot)? if so, then application level
workarounds are also possible -- and can even be automated if the
application just launches a whole new copy of itself via whatever
top-level means was used to launch itself in the first place.


I think someone is feeding you a very extreme view of LSF. I have interacted
 for years with people working with LSF-based systems, and can count on the
 fingers of one hand the people who are operating the way you describe.


perhaps -- i'm trying to convince the guy it's worth taking a look at
enhancing open-mpi/open-rte as opposed to continuing with his internal
effort. maybe i'll get him to chime in directly on this issue --
however ...



*Can* you use LSF that way? Sure. Is that how most people use it? Not from
 what I have seen. Still, if that's a mode you want to support...have at it!
 ;-)



... that said, his library already has the needed workarounds for this
usage model. still, the communication is much simpler -- TCP point to
point only (which is 'enough' for me now, but i'm not sure about the
future), and i'm a little worried about the maturity and (software
engineering and performance wise) scalability of his effort.



Keep in mind, though, that Open MPI is driven by performance for large-scale
 multiprocessor computations. As I indicated earlier, the type of operation
 you are describing will have to be treated as an "exception case".
 Literally, this means you are welcome to try and make it work, but the
 fundamental operations of the system won't be designed to optimize that mode
 at the sacrifice of the primary objective.



again, fair enough. ;)


 > duly noted. i don't pretend to be able to follow the current control flow at
 > the moment. i think just running the debug version with all the printouts
 > should help me a lot there. also, perhaps if i just make a rmgr_dyn_lsf, and
 > don't use sds, then there might not be as many subsystems involved to
 > complain. actually, i suspect the LSF specific part would be (very) small, so
 > perhaps it could be rmgr_dynurm + a new component type like dynraspls to
 > encapsulate the DRM specific part.


You have to use sds as this is the framework where the application process
 learns its name. 

[OMPI devel] Fwd: lsf support / farm use models

2007-07-15 Thread Matthew Moskewicz

Welcome! Yes, Jeff and I have been working on the LSF support based on 7.0
features in collab with the folks at Platform.



sounds good. i'm happy to be involved with such a nice active project!



> 1) it appears that you (jeff, i guess ;) are using new LSF 7.0 API
features.
> i'm working to support customers in the EDA space, and it's not clear
if/when
> they will migrate to 7.0 -- not to mention that our company (cadence)
doesn't
> appear to have LSF 7.0 yet. i'm still looking in to the deatils, but it
> appears that (from the Platform docs) lsb_getalloc is probably just a
thin
> wrapper around the LSB_MCPU_HOSTS (spelling?) environment variable. so
that
> could be worked around fairly easily. i dunno about lsb_launch -- it
seems
> equivalent to a set of ls_rtask() calls (one per process). however, i
have
> heard that there can be significant subtleties with the semantics of
these
> functions, in terms of compatibility across differently configured
> LSF-controlled farms, specifically with regrads to administrators
ability to
> track and control job execution. personally, i don't see how it's really

> possible for LSF to prevent 'bad' users from spamming out jobs or
> short-cutting queues, but perhaps some of the methods they attempt to
use can
> complicate things for a library like open-rte.

After lengthy discussions with Platform, it was deemed the best path
forward
is to use the lsb_getalloc interface. While it currently reads the enviro
variable, they indicated a potential change to read a file instead for
scalability. Rather than chasing any changes, we all agreed that using
lsb_getalloc would remain the "stable" interface - so that is what we
used.



understood.

Similar reasons for using lsb_launch. I would really advise against making

any changes away from that support. Instead, we could take a lesson from
our
bproc support and simply (a) detect if we are on a pre-7.0 release, and
then
(b) build our own internal wrapper that provides back-support. See the
bproc
pls component for examples.



that sounds fine -- should just be a matter of a little configure magic,
right? i already had to change the current configure stuff to be able to
build at all under 6.2 (since the current configure check requires 7.0 to
pass), so i guess it shouldn't be too much harder to mimic the bproc method
of detecting multiple versions, assuming it's really the same sort of thing.
basically, i'd keep the main LSF configure check downgraded as i have
currently done in my working copy, but add a new 7.0 check that is really
the current truck check.

then, i'll make signature-compatible replacements (with the same names? or
add internal functions to abstract things? or just add #ifdef's inline where
they are used?) for each missing LSF 7.0 function (implemented using the 6.1or
6.2 API), and have configure only build them if the system LSF doesn't have
them. uhm, once i figure out how to do that, anyway ... i guess i'll ask for
more help if the bproc code doesn't enlighten me. if successful, i should be
able to track trunk easily with respect to the LSF version issue at least.

i'll probably just continue experimenting on my own for the moment (tracking
any updates to the main trunk LSF support) to see if i can figure it out.
any advice the best way to get such back support into trunk, if and when if
exists / is working?




> 2) this brings us to point 2 -- upon talking to the author(s) of
cadence's
> internal open-rte-like library, several key issues were raised. mainly,
> customers want their applications to be 'farm-friendly' in several key
ways.
> firstly, they do not want any persistent daemons running outside of a
given
> job -- this requirement seems met by the current open-mpi default
behavior, at
> least as far i can tell. secondly, they prefer (strongly) that
applications
> acquire resources incrementally, and perform work with whatever nodes
are
> currently available, rather than forcing a large up-front node
allocation.
> fault tolerance is nice too, although it's unclear to me if it's really
> practically needed. in any case, many of our applications can structure
their
> computation to use resources in just such a way, generally by dividing
the
> work into independent, restartable pieces ( i.e. they are embarrassingly
||).
> also, MPI communication + MPI-2 process creation seems to be a
reasonable
> interface for handling communication and dynamic process creation on the
> application side. however, it's not clear that open-rte supports the
needed
> dynamic resource acquisition model in any of the ras/pls components i
looked
> at. in fact, other that just folding everything in the pls component,
it's not
> clear that the entire flow via the rmgr really supports it very well.
> specifically for LSF, the use model is that the initial job either is
created
> with bsub/lsb_submit(),  (or automatically submits itself as step zero
> perhaps) to run initially on N machines. N should be 'small' (1-16) --

[OMPI devel] lsf support / farm use models

2007-07-14 Thread Matthew Moskewicz

hi everyone,

firstly, i'm new around here, and somewhat clueless when it comes to the
details of working with an big autoconfiscated project like
open-rte/open-mpi the svn checkout level ...

i've read some of the archives that turned up in searches for terms like
'LSF', and it would seem there was some discussion about adding some form of
LSF support to open-rte, but that the discussion ended a while back. so,
after  playing around with the 1.2.3 release tarball for a while, and
reading  various pieces of the code until i  had a (vague) idea of the
top-level control flow and such, i decided i was ready to try to add ras and
pls component to support LSF. once i had the build system up, i tried to
create an ras/lsf directory, and slightly to my surprise, it already
existed. i was kinda hoping for that, but it appears to be *very* fresh code
at the moment. nonetheless, i played around a bit more, and ran into two
issues:

1) it appears that you (jeff, i guess ;) are using new LSF 7.0 API features.
i'm working to support customers in the EDA space, and it's not clear
if/when they will migrate to 7.0 -- not to mention that our company
(cadence) doesn't appear to have LSF 7.0 yet. i'm still looking in to the
deatils, but it appears that (from the Platform docs) lsb_getalloc is
probably just a thin wrapper around the LSB_MCPU_HOSTS (spelling?)
environment variable. so that could be worked around fairly easily. i dunno
about lsb_launch -- it seems equivalent to a set of ls_rtask() calls (one
per process). however, i have heard that there can be significant subtleties
with the semantics of these functions, in terms of compatibility across
differently configured LSF-controlled farms, specifically with regrads to
administrators ability to track and control job execution. personally, i
don't see how it's really possible for LSF to prevent 'bad' users from
spamming out jobs or short-cutting queues, but perhaps some of the methods
they attempt to use can complicate things for a library like open-rte.

2) this brings us to point 2 -- upon talking to the author(s) of cadence's
internal open-rte-like library, several key issues were raised. mainly,
customers want their applications to be 'farm-friendly' in several key ways.
firstly, they do not want any persistent daemons running outside of a given
job -- this requirement seems met by the current open-mpi default behavior,
at least as far i can tell. secondly, they prefer (strongly) that
applications acquire resources incrementally, and perform work with whatever
nodes are currently available, rather than forcing a large up-front node
allocation. fault tolerance is nice too, although it's unclear to me if it's
really practically needed. in any case, many of our applications can
structure their computation to use resources in just such a way, generally
by dividing the work into independent, restartable pieces (i.e. they are
embarrassingly ||). also, MPI communication + MPI-2 process creation seems
to be a reasonable interface for handling communication and dynamic process
creation on the application side. however, it's not clear that open-rte
supports the needed dynamic resource acquisition model in any of the ras/pls
components i looked at. in fact, other that just folding everything in the
pls component, it's not clear that the entire flow via the rmgr really
supports it very well. specifically for LSF, the use model is that the
initial job either is created with bsub/lsb_submit(),  (or automatically
submits itself as step zero perhaps) to run initially on N machines. N
should be 'small' (1-16) -- perhaps only 1 for simplicity. then, as the
application runs, it will continue to consume more resources as limited by
the farm status, the user selection, and the max # of processes that the job
can usefully support (generally 'large' -- 100-1000 cpus).

so, i figure it's up to me to implement this stuff ;) ... clearly, i want to
keep the 'normal' style ras/pls for LSF working, but somehow add the dynamic
behavior as an option. my initial thought was to (in the dynamic case)
basically ignore/fudge the ras/rmaps(/pls?) stages and simply use
bsub/lsb_submit() in pls to launch new daemons as needed/requested.  again,
though it's not clear that the current control flow supports this well.
given that there may be a large (10sec - 15min) delay between lsb_submit()
and job launch, it may be necessary to both acquire minimum size blocks of
new daemons at a time, and to have some non-blocking way to perform
spawning. for example, in the current code, the MPI-2 spawn is blocking
because it needs to return a communicator to the spawned process. however,
this is not really necessary for the application to continue -- it can
continue with other work until the new worker is up and running. perhaps
some form of multi-threading could help with this, but it's not totally
clear. i think i would prefer some lower-level open-rte calls that perform
daemon pre-allocation (i.e. dynamic ras/daemon