Sounds good then. I only got into this thread because (a) the reference to slurm, and (b) with Rainer's departure, I wasn't sure if someone else was going to pickup the alps support. Since you are re-assuming those latter duties (yes?), and since this actually has nothing to do with slurm itself, I'll let you decide when/if to deal with the patch.
I would only suggest that you remove the "slurm" comment from it as it is definitely confusing. Thanks Ralph On Jul 9, 2010, at 10:24 AM, Matney Sr, Kenneth D. wrote: > Ralph, > > His patch only modifies the ALPS RAS mca. And, it causes the environmental > variable BASIL_RESERVATION_ID to be a synonym for OMPI_ALPS_RESID. > It makes it convenient for the version of SLURM that they are proposing. But, > it does not invoke any side-effects. > -- > Ken Matney, Sr. > Oak Ridge National Laboratory > > > On Jul 9, 2010, at 12:15 PM, Ralph Castain wrote: > > Actually, this patch doesn't have anything to do with slurm according to the > documentation in the links. It has to do with Cray's batch allocator system, > which slurm is just interfacing to. So what you are really saying is that you > want the alps ras to run if we either detect the presence of alps acting as a > resource manager, or we detect that the Cray batch allocator has assigned an > id. > > However that latter id was assigned is irrelevant to the patch. > > True? > > You Cray guys out there: is this going to cause a conflict with other Cray > installations? > > > On Jul 9, 2010, at 9:44 AM, Jerome Soumagne wrote: > > another link which can be worth mentioning: > https://computing.llnl.gov/linux/slurm/cray.html > > it says at the top of the page NOTE: As of January 2009, the SLURM interface > to Cray systems is incomplete. > but what we have now on our system is something which is reasonably stable > and a good part of the SLURM interface to Cray is now complete. > What we have at CSCS is a list of patches which improve and complete the > interface. As I said, these modifications will be submitted for the November > release of SLURM. Again, there is nothing non-standard in it. > > I hope that it helps, > > Jerome > > On 07/09/2010 05:02 PM, Jerome Soumagne wrote: > It's not invented, it's a SLURM standard name. Sorry for not having said > that, my first e-mail was really too short. > http://manpages.ubuntu.com/manpages/lucid/man1/sbatch.1.html > http://slurm-llnl.sourcearchive.com/documentation/2.1.1/basil__interface_8c-source.html > ... > > google could have been your friend in this case... ;) (but I agree, that's > really a strange name) > > Jerome > > On 07/09/2010 04:27 PM, Ralph Castain wrote: > To clarify: what I'm trying to understand is what the heck a > "BASIL_RESERVATION_ID" is - it isn't a standard slurm thing, nor can I find > it defined in alps, so it appears to just be a local name you invented. True? > > If so, I would rather see some standard name instead of something local to > one organization. > > On Jul 9, 2010, at 8:08 AM, Jerome Soumagne wrote: > > Ok I may have not explained very clearly. In our case we only use SLURM for > the resource manager. > The difference here is that the SLURM version that we use has support for > ALPS. Therefore when we run our job using the mpirun command, since we have > the alps environment loaded, it's the ALPS RAS which is selected, and the > ALPS PLM as well. I think I could even not compile the OpenMPI slurm support. > > Here is what we do for example: here is my batch script (with the patched > version) > #!/bin/bash > #SBATCH --job-name=HelloOMPI > #SBATCH --nodes=2 > #SBATCH --time=00:30:00 > > set -ex > cd /users/soumagne/gele/hello > mpirun --mca ras_base_verbose 10 --mca plm_base_verbose 10 -np 2 --bynode > `pwd`/hello > > And here is the output that I get: > soumagne@gele1:~/gele/hello> more slurm-165.out > + cd /users/soumagne/gele/hello > ++ pwd > + mpirun --mca ras_base_verbose 10 --mca plm_base_verbose 10 --bynode -np 2 > /use > rs/soumagne/gele/hello/hello > [gele2:15844] mca: base: components_open: Looking for plm components > [gele2:15844] mca: base: components_open: opening plm components > [gele2:15844] mca: base: components_open: found loaded component alps > [gele2:15844] mca: base: components_open: component alps has no register > functio > n > [gele2:15844] mca: base: components_open: component alps open function > successfu > l > [gele2:15844] mca: base: components_open: found loaded component slurm > [gele2:15844] mca: base: components_open: component slurm has no register > functi > on > [gele2:15844] mca: base: components_open: component slurm open function > successf > ul > [gele2:15844] mca:base:select: Auto-selecting plm components > [gele2:15844] mca:base:select:( plm) Querying component [alps] > [gele2:15844] mca:base:select:( plm) Query of component [alps] set priority > to > 75 > [gele2:15844] mca:base:select:( plm) Querying component [slurm] > [gele2:15844] mca:base:select:( plm) Query of component [slurm] set priority > to > 75 > [gele2:15844] mca:base:select:( plm) Selected component [alps] > [gele2:15844] mca: base: close: component slurm closed > [gele2:15844] mca: base: close: unloading component slurm > [gele2:15844] mca: base: components_open: Looking for ras components > [gele2:15844] mca: base: components_open: opening ras components > [gele2:15844] mca: base: components_open: found loaded component cm > [gele2:15844] mca: base: components_open: component cm has no register > function > [gele2:15844] mca: base: components_open: component cm open function > successful > [gele2:15844] mca: base: components_open: found loaded component alps > [gele2:15844] mca: base: components_open: component alps has no register > functio > n > [gele2:15844] mca: base: components_open: component alps open function > successfu > l > [gele2:15844] mca: base: components_open: found loaded component slurm > [gele2:15844] mca: base: components_open: component slurm has no register > functi > on > [gele2:15844] mca: base: components_open: component slurm open function > successf > ul > [gele2:15844] mca:base:select: Auto-selecting ras components > [gele2:15844] mca:base:select:( ras) Querying component [cm] > [gele2:15844] mca:base:select:( ras) Skipping component [cm]. Query failed > to r > eturn a module > [gele2:15844] mca:base:select:( ras) Querying component [alps] > [gele2:15844] ras:alps: available for selection > [gele2:15844] mca:base:select:( ras) Query of component [alps] set priority > to > 75 > [gele2:15844] mca:base:select:( ras) Querying component [slurm] > [gele2:15844] mca:base:select:( ras) Query of component [slurm] set priority > to > 75 > [gele2:15844] mca:base:select:( ras) Selected component [alps] > [gele2:15844] mca: base: close: unloading component cm > [gele2:15844] mca: base: close: unloading component slurm > [gele2:15844] ras:alps:allocate: Using ALPS configuration file: > "/etc/sysconfig/ > alps" > [gele2:15844] ras:alps:allocate: Located ALPS scheduler file: > "/ufs/alps_shared/ > appinfo" > [gele2:15844] ras:alps:orte_ras_alps_get_appinfo_attempts: 10 > [gele2:15844] ras:alps:read_appinfo: got NID 16 > [gele2:15844] ras:alps:read_appinfo: added NID 16 to list > [gele2:15844] ras:alps:read_appinfo: got NID 16 > [gele2:15844] ras:alps:read_appinfo: got NID 16 > [gele2:15844] ras:alps:read_appinfo: got NID 16 > [gele2:15844] ras:alps:read_appinfo: got NID 16 > [gele2:15844] ras:alps:read_appinfo: got NID 16 > [gele2:15844] ras:alps:read_appinfo: got NID 16 > [gele2:15844] ras:alps:read_appinfo: got NID 16 > [gele2:15844] ras:alps:read_appinfo: got NID 16 > [gele2:15844] ras:alps:read_appinfo: got NID 16 > [gele2:15844] ras:alps:read_appinfo: got NID 16 > [gele2:15844] ras:alps:read_appinfo: got NID 16 > [gele2:15844] ras:alps:read_appinfo: got NID 20 > [gele2:15844] ras:alps:read_appinfo: added NID 20 to list > [gele2:15844] ras:alps:read_appinfo: got NID 20 > [gele2:15844] ras:alps:read_appinfo: got NID 20 > [gele2:15844] ras:alps:read_appinfo: got NID 20 > [gele2:15844] ras:alps:read_appinfo: got NID 20 > [gele2:15844] ras:alps:read_appinfo: got NID 20 > [gele2:15844] ras:alps:read_appinfo: got NID 20 > [gele2:15844] ras:alps:read_appinfo: got NID 20 > [gele2:15844] ras:alps:read_appinfo: got NID 20 > [gele2:15844] ras:alps:read_appinfo: got NID 20 > [gele2:15844] ras:alps:read_appinfo: got NID 20 > [gele2:15844] ras:alps:read_appinfo: got NID 20 > [gele2:15844] ras:alps:allocate: success > I am nid00020 process 2/2 > I am nid00016 process 1/2 > [gele2:15844] mca: base: close: unloading component alps > [gele2:15844] mca: base: close: component alps closed > [gele2:15844] mca: base: close: unloading component alps > > I think that in this case you would not break anything since it's really a > basic patch which enables you to directly do an mpirun, without having to > manually select any reservation id (assuming that the user has the SLURM > version with ALPS support which will be available soon). > > Jerome > > On 07/09/2010 03:06 PM, Ralph Castain wrote: > Afraid I'm now even more confused. You use SLURM to do the allocation, and > then use ALPS to launch the job? > > I'm just trying to understand because I'm the person who generally maintains > this code area. We have two frameworks involved here: > > 1. RAS - determines what nodes were allocated to us. There are both slurm and > alps modules here. > > 2. PLM - actually launches the job. There are both slurm and alps modules > here. > > Up until now, we have always seen people running with either alps or slurm, > but never both together, so the module selection of these two frameworks is > identical - if you select slurm for the RAS module, you will definitely get > slurm for the launcher. Ditto for alps. Are you sure that mpirun is actually > using the modules you think? Have you run this with -mca ras_base_verbose 10 > -mca plm_base_verbose 10 and seen what modules are being used? > > In any event, this seems like a very strange combination, but I assume you > have some reason for doing this? > > I'm always leery of fiddling with the SLURM modules as (a) there aren't very > many slurm users out there, (b) the primary users are the DOE national labs > themselves, using software provided by LLNL (who controls slurm), and (c) > there are major disconnects between the various slurm releases, so we wind up > breaking things for someone rather easily. > > So the more I can understand what you are doing, the easier it is to > determine how to use your patch without breaking slurm support for others. > > Thanks! > Ralph > > > On Jul 9, 2010, at 6:46 AM, Jerome Soumagne wrote: > > Well we actually use a patched version of SLURM, 2.2.0-pre8. It is planned to > submit the modifications made internally at CSCS for the next SLURM release > in November. We implement ALPS support based on the basic architecture of > SLURM. > SLURM is only used to do the ALPS ressource allocation. We then use mpirun > based on the portals and on the alps libaries. > We don't use mca parameters to direct selection and the alps RAS is > automatically well selected. > > On 07/09/2010 01:59 PM, Ralph Castain wrote: > Forgive my confusion, but could you please clarify something? You are using > ALPS as the resource manager doing the allocation, and then using SLURM as > the launcher (instead of ALPS)? > > That's a combination we've never seen or heard about. I suspect our module > selection logic would be confused by such a combination - are you using mca > params to direct selection? > > > On Jul 9, 2010, at 4:19 AM, Jerome Soumagne wrote: > > Hi, > > We've recently installed OpenMPI on one of our Cray XT5 machines, here at > CSCS. This machine uses SLURM for launching jobs. > Doing an salloc defines this environment variable: > BASIL_RESERVATION_ID > The reservation ID on Cray systems running ALPS/BASIL only. > > Since the alps ras module tries to find a variable called OMPI_ALPS_RESID > which is set using a script, we thought that for SLURM systems it would be a > good idea to directly integrate this BASIL_RESERVATION_ID variable in the > code, rather than using a script. The small patch is attached. > > Regards, > > Jerome > > -- > Jérôme Soumagne > Scientific Computing Research Group > CSCS, Swiss National Supercomputing Centre > Galleria 2, Via Cantonale | Tel: +41 (0)91 610 8258 > CH-6928 Manno, Switzerland | Fax: +41 (0)91 610 8282 > > > > <patch_slurm_alps.txt>_______________________________________________ > devel mailing list > de...@open-mpi.org<mailto:de...@open-mpi.org> > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org<mailto:de...@open-mpi.org> > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > -- > Jérôme Soumagne > Scientific Computing Research Group > CSCS, Swiss National Supercomputing Centre > Galleria 2, Via Cantonale | Tel: +41 (0)91 610 8258 > CH-6928 Manno, Switzerland | Fax: +41 (0)91 610 8282 > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org<mailto:de...@open-mpi.org> > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org<mailto:de...@open-mpi.org> > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org<mailto:de...@open-mpi.org> > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org<mailto:de...@open-mpi.org> > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org<mailto:de...@open-mpi.org> > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org<mailto:de...@open-mpi.org> > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > <ATT00001..txt> > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel