My bad - I see that you actually do patch the alps ras. Is BASIL_RESERVATION_ID something included in alps, or is this just a name you invented?
On Jul 9, 2010, at 8:08 AM, Jerome Soumagne wrote: > Ok I may have not explained very clearly. In our case we only use SLURM for > the resource manager. > The difference here is that the SLURM version that we use has support for > ALPS. Therefore when we run our job using the mpirun command, since we have > the alps environment loaded, it's the ALPS RAS which is selected, and the > ALPS PLM as well. I think I could even not compile the OpenMPI slurm support. > > Here is what we do for example: here is my batch script (with the patched > version) > #!/bin/bash > #SBATCH --job-name=HelloOMPI > #SBATCH --nodes=2 > #SBATCH --time=00:30:00 > > set -ex > cd /users/soumagne/gele/hello > mpirun --mca ras_base_verbose 10 --mca plm_base_verbose 10 -np 2 --bynode > `pwd`/hello > > And here is the output that I get: > soumagne@gele1:~/gele/hello> more slurm-165.out > + cd /users/soumagne/gele/hello > ++ pwd > + mpirun --mca ras_base_verbose 10 --mca plm_base_verbose 10 --bynode -np 2 > /use > rs/soumagne/gele/hello/hello > [gele2:15844] mca: base: components_open: Looking for plm components > [gele2:15844] mca: base: components_open: opening plm components > [gele2:15844] mca: base: components_open: found loaded component alps > [gele2:15844] mca: base: components_open: component alps has no register > functio > n > [gele2:15844] mca: base: components_open: component alps open function > successfu > l > [gele2:15844] mca: base: components_open: found loaded component slurm > [gele2:15844] mca: base: components_open: component slurm has no register > functi > on > [gele2:15844] mca: base: components_open: component slurm open function > successf > ul > [gele2:15844] mca:base:select: Auto-selecting plm components > [gele2:15844] mca:base:select:( plm) Querying component [alps] > [gele2:15844] mca:base:select:( plm) Query of component [alps] set priority > to > 75 > [gele2:15844] mca:base:select:( plm) Querying component [slurm] > [gele2:15844] mca:base:select:( plm) Query of component [slurm] set priority > to > 75 > [gele2:15844] mca:base:select:( plm) Selected component [alps] > [gele2:15844] mca: base: close: component slurm closed > [gele2:15844] mca: base: close: unloading component slurm > [gele2:15844] mca: base: components_open: Looking for ras components > [gele2:15844] mca: base: components_open: opening ras components > [gele2:15844] mca: base: components_open: found loaded component cm > [gele2:15844] mca: base: components_open: component cm has no register > function > [gele2:15844] mca: base: components_open: component cm open function > successful > [gele2:15844] mca: base: components_open: found loaded component alps > [gele2:15844] mca: base: components_open: component alps has no register > functio > n > [gele2:15844] mca: base: components_open: component alps open function > successfu > l > [gele2:15844] mca: base: components_open: found loaded component slurm > [gele2:15844] mca: base: components_open: component slurm has no register > functi > on > [gele2:15844] mca: base: components_open: component slurm open function > successf > ul > [gele2:15844] mca:base:select: Auto-selecting ras components > [gele2:15844] mca:base:select:( ras) Querying component [cm] > [gele2:15844] mca:base:select:( ras) Skipping component [cm]. Query failed > to r > eturn a module > [gele2:15844] mca:base:select:( ras) Querying component [alps] > [gele2:15844] ras:alps: available for selection > [gele2:15844] mca:base:select:( ras) Query of component [alps] set priority > to > 75 > [gele2:15844] mca:base:select:( ras) Querying component [slurm] > [gele2:15844] mca:base:select:( ras) Query of component [slurm] set priority > to > 75 > [gele2:15844] mca:base:select:( ras) Selected component [alps] > [gele2:15844] mca: base: close: unloading component cm > [gele2:15844] mca: base: close: unloading component slurm > [gele2:15844] ras:alps:allocate: Using ALPS configuration file: > "/etc/sysconfig/ > alps" > [gele2:15844] ras:alps:allocate: Located ALPS scheduler file: > "/ufs/alps_shared/ > appinfo" > [gele2:15844] ras:alps:orte_ras_alps_get_appinfo_attempts: 10 > [gele2:15844] ras:alps:read_appinfo: got NID 16 > [gele2:15844] ras:alps:read_appinfo: added NID 16 to list > [gele2:15844] ras:alps:read_appinfo: got NID 16 > [gele2:15844] ras:alps:read_appinfo: got NID 16 > [gele2:15844] ras:alps:read_appinfo: got NID 16 > [gele2:15844] ras:alps:read_appinfo: got NID 16 > [gele2:15844] ras:alps:read_appinfo: got NID 16 > [gele2:15844] ras:alps:read_appinfo: got NID 16 > [gele2:15844] ras:alps:read_appinfo: got NID 16 > [gele2:15844] ras:alps:read_appinfo: got NID 16 > [gele2:15844] ras:alps:read_appinfo: got NID 16 > [gele2:15844] ras:alps:read_appinfo: got NID 16 > [gele2:15844] ras:alps:read_appinfo: got NID 16 > [gele2:15844] ras:alps:read_appinfo: got NID 20 > [gele2:15844] ras:alps:read_appinfo: added NID 20 to list > [gele2:15844] ras:alps:read_appinfo: got NID 20 > [gele2:15844] ras:alps:read_appinfo: got NID 20 > [gele2:15844] ras:alps:read_appinfo: got NID 20 > [gele2:15844] ras:alps:read_appinfo: got NID 20 > [gele2:15844] ras:alps:read_appinfo: got NID 20 > [gele2:15844] ras:alps:read_appinfo: got NID 20 > [gele2:15844] ras:alps:read_appinfo: got NID 20 > [gele2:15844] ras:alps:read_appinfo: got NID 20 > [gele2:15844] ras:alps:read_appinfo: got NID 20 > [gele2:15844] ras:alps:read_appinfo: got NID 20 > [gele2:15844] ras:alps:read_appinfo: got NID 20 > [gele2:15844] ras:alps:allocate: success > I am nid00020 process 2/2 > I am nid00016 process 1/2 > [gele2:15844] mca: base: close: unloading component alps > [gele2:15844] mca: base: close: component alps closed > [gele2:15844] mca: base: close: unloading component alps > > I think that in this case you would not break anything since it's really a > basic patch which enables you to directly do an mpirun, without having to > manually select any reservation id (assuming that the user has the SLURM > version with ALPS support which will be available soon). > > Jerome > > On 07/09/2010 03:06 PM, Ralph Castain wrote: >> >> Afraid I'm now even more confused. You use SLURM to do the allocation, and >> then use ALPS to launch the job? >> >> I'm just trying to understand because I'm the person who generally maintains >> this code area. We have two frameworks involved here: >> >> 1. RAS - determines what nodes were allocated to us. There are both slurm >> and alps modules here. >> >> 2. PLM - actually launches the job. There are both slurm and alps modules >> here. >> >> Up until now, we have always seen people running with either alps or slurm, >> but never both together, so the module selection of these two frameworks is >> identical - if you select slurm for the RAS module, you will definitely get >> slurm for the launcher. Ditto for alps. Are you sure that mpirun is actually >> using the modules you think? Have you run this with -mca ras_base_verbose 10 >> -mca plm_base_verbose 10 and seen what modules are being used? >> >> In any event, this seems like a very strange combination, but I assume you >> have some reason for doing this? >> >> I'm always leery of fiddling with the SLURM modules as (a) there aren't very >> many slurm users out there, (b) the primary users are the DOE national labs >> themselves, using software provided by LLNL (who controls slurm), and (c) >> there are major disconnects between the various slurm releases, so we wind >> up breaking things for someone rather easily. >> >> So the more I can understand what you are doing, the easier it is to >> determine how to use your patch without breaking slurm support for others. >> >> Thanks! >> Ralph >> >> >> On Jul 9, 2010, at 6:46 AM, Jerome Soumagne wrote: >> >>> Well we actually use a patched version of SLURM, 2.2.0-pre8. It is planned >>> to submit the modifications made internally at CSCS for the next SLURM >>> release in November. We implement ALPS support based on the basic >>> architecture of SLURM. >>> SLURM is only used to do the ALPS ressource allocation. We then use mpirun >>> based on the portals and on the alps libaries. >>> We don't use mca parameters to direct selection and the alps RAS is >>> automatically well selected. >>> >>> On 07/09/2010 01:59 PM, Ralph Castain wrote: >>>> >>>> Forgive my confusion, but could you please clarify something? You are >>>> using ALPS as the resource manager doing the allocation, and then using >>>> SLURM as the launcher (instead of ALPS)? >>>> >>>> That's a combination we've never seen or heard about. I suspect our module >>>> selection logic would be confused by such a combination - are you using >>>> mca params to direct selection? >>>> >>>> >>>> On Jul 9, 2010, at 4:19 AM, Jerome Soumagne wrote: >>>> >>>>> Hi, >>>>> >>>>> We've recently installed OpenMPI on one of our Cray XT5 machines, here at >>>>> CSCS. This machine uses SLURM for launching jobs. >>>>> Doing an salloc defines this environment variable: >>>>> BASIL_RESERVATION_ID >>>>> The reservation ID on Cray systems running ALPS/BASIL only. >>>>> >>>>> Since the alps ras module tries to find a variable called OMPI_ALPS_RESID >>>>> which is set using a script, we thought that for SLURM systems it would >>>>> be a good idea to directly integrate this BASIL_RESERVATION_ID variable >>>>> in the code, rather than using a script. The small patch is attached. >>>>> >>>>> Regards, >>>>> >>>>> Jerome >>>>> -- >>>>> Jérôme Soumagne >>>>> Scientific Computing Research Group >>>>> CSCS, Swiss National Supercomputing Centre >>>>> Galleria 2, Via Cantonale | Tel: +41 (0)91 610 8258 >>>>> CH-6928 Manno, Switzerland | Fax: +41 (0)91 610 8282 >>>>> >>>>> >>>>> <patch_slurm_alps.txt>_______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> -- >>> Jérôme Soumagne >>> Scientific Computing Research Group >>> CSCS, Swiss National Supercomputing Centre >>> Galleria 2, Via Cantonale | Tel: +41 (0)91 610 8258 >>> CH-6928 Manno, Switzerland | Fax: +41 (0)91 610 8282 >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel