Hi Matt
On 7/15/07 1:49 PM, "Matthew Moskewicz" <moske...@alumni.princeton.edu> wrote: > >> Welcome! Yes, Jeff and I have been working on the LSF support based on 7.0 >> features in collab with the folks at Platform. > > sounds good. i'm happy to be involved with such a nice active project! > >>> 1) it appears that you (jeff, i guess ;) are using new LSF 7.0 API features. >>> i'm working to support customers in the EDA space, and it's not clear >>> if/when >>> they will migrate to 7.0 -- not to mention that our company (cadence) >>> doesn't >>> appear to have LSF 7.0 yet. i'm still looking in to the deatils, but it >>> appears that (from the Platform docs) lsb_getalloc is probably just a thin >>> wrapper around the LSB_MCPU_HOSTS (spelling?) environment variable. so that >>> could be worked around fairly easily. i dunno about lsb_launch -- it seems >>> equivalent to a set of ls_rtask() calls (one per process). however, i have >>> heard that there can be significant subtleties with the semantics of these >>> functions, in terms of compatibility across differently configured >>> LSF-controlled farms, specifically with regrads to administrators ability to >>> track and control job execution. personally, i don't see how it's really >>> possible for LSF to prevent 'bad' users from spamming out jobs or >>> short-cutting queues, but perhaps some of the methods they attempt to use >>> can >>> complicate things for a library like open-rte. >> >> After lengthy discussions with Platform, it was deemed the best path forward >> is to use the lsb_getalloc interface. While it currently reads the enviro >> variable, they indicated a potential change to read a file instead for >> scalability. Rather than chasing any changes, we all agreed that using >> lsb_getalloc would remain the "stable" interface - so that is what we used. > > understood. > >> Similar reasons for using lsb_launch. I would really advise against making >> any changes away from that support. Instead, we could take a lesson from our >> bproc support and simply (a) detect if we are on a pre-7.0 release, and then >> (b) build our own internal wrapper that provides back-support. See the bproc >> pls component for examples. > > that sounds fine -- should just be a matter of a little configure magic, > right? i already had to change the current configure stuff to be able to build > at all under 6.2 (since the current configure check requires 7.0 to pass), so > i guess it shouldn't be too much harder to mimic the bproc method of detecting > multiple versions, assuming it's really the same sort of thing. basically, i'd > keep the main LSF configure check downgraded as i have currently done in my > working copy, but add a new 7.0 check that is really the current truck check. > > then, i'll make signature-compatible replacements (with the same names? or add > internal functions to abstract things? or just add #ifdef's inline where they > are used?) for each missing LSF 7.0 function (implemented using the 6.1 or 6.2 > API), and have configure only build them if the system LSF doesn't have them. > uhm, once i figure out how to do that, anyway ... i guess i'll ask for more > help if the bproc code doesn't enlighten me. if successful, i should be able > to track trunk easily with respect to the LSF version issue at least. > This sounds fine - you'll find that the bproc pls does the exact same thing. In that case, we use #ifdefs since the APIs are actually different between the versions - we just create a wrapper inside the bproc pls code for the older version so that we can always call the same API. I'm not sure what the case will be in LSF - I believe the function calls are indeed different, so you might be able to use the same approach. > i'll probably just continue experimenting on my own for the moment (tracking > any updates to the main trunk LSF support) to see if i can figure it out. any > advice the best way to get such back support into trunk, if and when if exists > / is working? The *best* way would be for you to sign a third-party agreement - see the web site for details and a copy. Barring that, the only option would be to submit the code through either Jeff or I. We greatly prefer the agreement method as it is (a) less burdensome on us and (b) gives you greater flexibility. > > >>> >>> 2) this brings us to point 2 -- upon talking to the author(s) of cadence's >>> internal open-rte-like library, several key issues were raised. mainly, >>> customers want their applications to be 'farm-friendly' in several key ways. >>> firstly, they do not want any persistent daemons running outside of a given >>> job -- this requirement seems met by the current open-mpi default behavior, >>> at >>> least as far i can tell. secondly, they prefer (strongly) that applications >>> acquire resources incrementally, and perform work with whatever nodes are >>> currently available, rather than forcing a large up-front node allocation. >>> fault tolerance is nice too, although it's unclear to me if it's really >>> practically needed. in any case, many of our applications can structure >>> their >>> computation to use resources in just such a way, generally by dividing the >>> work into independent, restartable pieces ( i.e. they are embarrassingly >>> ||). >>> also, MPI communication + MPI-2 process creation seems to be a reasonable >>> interface for handling communication and dynamic process creation on the >>> application side. however, it's not clear that open-rte supports the needed >>> dynamic resource acquisition model in any of the ras/pls components i looked >>> at. in fact, other that just folding everything in the pls component, it's >>> not >>> clear that the entire flow via the rmgr really supports it very well. >>> specifically for LSF, the use model is that the initial job either is >>> created >>> with bsub/lsb_submit(), (or automatically submits itself as step zero >>> perhaps) to run initially on N machines. N should be 'small' (1-16) -- >>> perhaps >>> only 1 for simplicity. then, as the application runs, it will continue to >>> consume more resources as limited by the farm status, the user selection, >>> and >>> the max # of processes that the job can usefully support (generally 'large' >>> -- >>> 100-1000 cpus). >> >> OpenRTE will be undergoing some changes shortly, so I would strongly >> recommend you avoid making major changes without first discussing how they >> fit into the new design with us. While cadence is a nice system, there are >> tradeoffs in every design approach - and it isn't clear that theirs is >> necessarily any better than another. >> >> We could argue for quite some time about their beliefs regarding customers >> desires - I have heard these statements in multiple directions, with people >> citing claims of customer "demands" pointing every which way. Bottom line, >> from what I can tell, is that customers want something that works and is >> transparent to them - how that is done is largely irrelevant. > > yeah, i agree with that completely. > > >> We have other people working on dynamic resource allocation for other >> systems (e.g., TM), and are making some modifications to better support that >> kind of requirement. We can discuss those with you if you like to see how >> they meet your needs. Not much was done in the past in that regard because >> people weren't interested in it. Frankly, we are somewhat moving in the >> other direction now, so supporting it in the manner you describe may >> possibly become harder rather than easier. You may have to accept some >> less-than-ideal result, I fear. > > well, i guess it basically boils down to having some level of support for > dynamic resource allocation, so that if an application supports or needs to > structure it's computation that way, it can do so. my impression from reading > the MPI-2 spec (or somewhere else?) was that a big part of the motivation > behind MPI-2 dynamic processes creation was to support just such models (a la > pvm) -- and it seems that the rte layer needs matching support, or it can't > really work well. if there is some support at all, or if it's not too hard to > add, i guess i'll be happy. I can't speak to the motivation behind MPI-2 - the others in the group can do a much better job of that. What I can say is that we started out with a design to support such modes of operation as dynamic farms, but the group has been moving away from it due to a combination of performance impacts, reliability, and (frankly) lack of interest from our user community. Our intent now is to cut the RTE back to the basics required to support the MPI standard, including MPI-2 - which arguably says nothing about dynamic resource allocation. Not to say we won't support it - just indicating that such support will have lower priority and that the system will be designed primarily for other priorities. So dynamic resource allocation will have to be considered as an "exception case", with all the attendant implications. > > that said, i'd like to reiterate (and skip this paragraph if you get bored) > that, at a basic level, i think the ideas behind pvm and dynamic resource > allocation are pretty well founded and useful. the idea is to work *with* the > existing DRM, rather than only having a private allocation layer over a static > allocation from the host DRM. for applications that are capable of being > dynamically flexible about the number of CPUs they need, static initial > allocation just doesn't work too well -- a small initial allocation may limit > the || too much, whereas a large allocation may be wasteful, and may (vastly) > increase the queue time to job startup. in fact, when the queue time is long, > it's extra-wasteful, because the DRM has to hold a bunch of hosts idle waiting > for the whole allocation to be satisfied. in all i've heard, this seems to be > the most 'real' customer issue -- that is, i believe the other cadence > distributed processing guys when they say that they are having or have had > problems with various applications -- both MPI based (LAM/MPI i think -- which > had other problems concerning the deamon issue) as well as custom frameworks > that simply made large (>100) bsub requests. the most pathological thing i've > heard internally is that for maximum portability across different LSF farms, > not only do you need to acquire resources incrementally, but that you need to > acquire each CPU individually using a single bsub -- that is, you shouldn't > even use the -n option to bsub *at all*. this actually simplifies things in > some ways, but i don't really know if i believe it. anyway, that's what i've > heard, from the cadence open-rte-alike people that are really running > applications on customer farms. somehow, there are problems with accounting or > something on certain farms if you bsub non-single cpu tasks. on second > thought, i can actually believe this, because the EDA community really doesn't > run many true scientific-computing style multiprocessor jobs at the moment -- > mainly, they are running multiple separate jobs that only loosely communicate > via the file system, or not at all -- there's just some script that launch all > the pieces of a job, and the pieces are in charge of co-ordinating with > themselves if needed. since application have evolved from this 'primitive' > form of using multiple CPUs, it's not too surprising that farms might not be > well configured to support the more traditional scientific computing use > models. i'm continuing to investigate the issue, and i'll have more data as i > start enabling farm support in my own app on some real customer farms -- > assuming i can get something working with open-mpi, of course! ;) > I think someone is feeding you a very extreme view of LSF. I have interacted for years with people working with LSF-based systems, and can count on the fingers of one hand the people who are operating the way you describe. *Can* you use LSF that way? Sure. Is that how most people use it? Not from what I have seen. Still, if that's a mode you want to support...have at it! ;-) Keep in mind, though, that Open MPI is driven by performance for large-scale multiprocessor computations. As I indicated earlier, the type of operation you are describing will have to be treated as an "exception case". Literally, this means you are welcome to try and make it work, but the fundamental operations of the system won't be designed to optimize that mode at the sacrifice of the primary objective. > >>> >>> so, i figure it's up to me to implement this stuff ;) ... clearly, i want to >>> keep the 'normal' style ras/pls for LSF working, but somehow add the dynamic >>> behavior as an option. my initial thought was to (in the dynamic case) >>> basically ignore/fudge the ras/rmaps(/pls?) stages and simply use >>> bsub/lsb_submit() in pls to launch new daemons as needed/requested. >> >> Just an FYI: this could cause unexpected behavior in the current >> implementation as a number of subsystems depend upon the info coming from >> those stages. May not be as big a problem in the revised implementation >> currently underway. > > duly noted. i don't pretend to be able to follow the current control flow at > the moment. i think just running the debug version with all the printouts > should help me a lot there. also, perhaps if i just make a rmgr_dyn_lsf, and > don't use sds, then there might not be as many subsystems involved to > complain. actually, i suspect the LSF specific part would be (very) small, so > perhaps it could be rmgr_dynurm + a new component type like dynraspls to > encapsulate the DRM specific part. You have to use sds as this is the framework where the application process learns its name. That framework will be receiving more responsibilities in the revised implementation, so you'll unfortunately have to use it. Your best bet (IMHO) would be to create an lsf_farm component in the new PLM when we get the system revised. > >>> again, >>> though it's not clear that the current control flow supports this well. >>> given >>> that there may be a large (10sec - 15min) delay between lsb_submit() and job >>> launch, it may be necessary to both acquire minimum size blocks of new >>> daemons >>> at a time, and to have some non-blocking way to perform spawning. for >>> example, >>> in the current code, the MPI-2 spawn is blocking because it needs to return >>> a >>> communicator to the spawned process. >> >> Actually, that is not the real reason. It is blocking because the parent >> wants to send a message to the new children telling them where/how to >> rendezvous with it. The problem is that the parent doesn't know the name of >> the child until after the spawn is completed - because we need the child's >> OOB contact info so we can send the message. The easiest way to ensure that >> all the handshakes occurred correctly was to simply make comm_spawn >> blocking. >> >> Given that comm_spawn in our current environments is relatively fast, that >> was deemed to be an acceptable solution. Obviously, your stated time frames >> are much, much longer, so that might not work in those cases. >> >> It would be easier to change it under the revised implementation, which will >> better support that kind of difference between environments. In the current >> one, it could be quite problematic. >>> however, this is not really necessary for >>> the application to continue -- it can continue with other work until the new >>> worker is up and running. perhaps some form of multi-threading could help >>> with >>> this, but it's not totally clear. i think i would prefer some lower-level >>> open-rte calls that perform daemon pre-allocation ( i.e. dynamic ras/daemon >>> startup), such that i know that if there are idle daemons, it is safe to >>> spawn >>> without risk of blocking. >> >> I'll have to leave that up to the MPI folks on the team - we have >> historically resisted the idea of having one environment behave differently >> from another so as to limit "user astonishment". However, if they can live >> with that change, I personally have no problem with it. >> >> We just made a significant change to daemon launch procedures, and the flow >> between the stages is going to be completely revamped over the next few >> months. How that affects your thinking is unclear to me at the moment, but >> might be worth further discussion. >> >> Just as an FYI: we already check to see if there are available daemons, and >> we do spawn upon them if so. The issue here sounds like it is more in >> obtaining a larger-than-immediately-needed allocation, and spawning daemons >> on all of it just-in-case they are needed. There is nothing in the system >> that precludes doing so - we made a design decision early on not to do it, >> but that's not a requirement. Again, the revised implementation would let >> you do that much easier than the current one. > > > hmm, i'm thinking that if there was a way to directly tell open-rte to acquire > more daemons non-blockingly, that would be enough. > in the LSF case, i think one would bsub the daemons themselves (with arguments > sufficient to phone-home, so no sds needed?), so (node acquisition == daemon > startup). You could - though this sounds pretty non-scalable to me. > > this functions could be called heuristically by MPI-2 spawn type functions, or > even manually by the application (in the short term). it should not effect the > semantics of the MPI-2 calls themselves. Your best bet would be to have your own component so that you could do whatever you wanted with the spawn API. You could play with an RMGR component for now, but your best bet is clearly going to be the new PLM. > > the goal is that one could determine (at least with some confidence) if there > were any free (and ready to spawn quickly without blocking) resources before > issuing a spawn call. this might just mean examining the value of the MPI > universe size (and that this value could change), or it might need some new > interface, i dunno. You know, the real issue here (I think) is being driven by your use of bsub - which I believe is a batch launch request. Why would you want to do that instead of just directly calling lsb_launch()? I suspect we can get the Platform folks to give us an API to request additional node allocations from inside our program, so why not just use the API to launch? Or are you going the batch route because we don't currently have an API and you want to support older LSF versions? Might be more pain than it's worth... > >>> >>> oh, and at first glance there appears to be a bunch of duplicated code >>> across >>> the various flavors of ras (and similarly for pls, sds). is it reasonable to >>> attempt to factor things out? i seem to recall reading that some major >>> rework >>> was in progress, so perhaps this would not be a good time? >> >> Definitely not a good time - I would wait awhile and let's see how much of >> it remains. Some of it is there because of historical uncertainty over what >> would be common and what wouldn't be - some might be there for a reason >> known to the original author. I would advise asking before assuming... > > okay. > >>> >>> uhm ... well, any advice on anything here? >>> >>> thanks, >>> >>> Matt. >>> > > thanks again, > > Matt. > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel