I would prefer the first patch though so that we get rid of scripts and of another env variable but well, I let you choose.

Jerome

On 07/09/2010 06:27 PM, Jerome Soumagne wrote:
Hi Ken,

That's interesting, setting the OMPI_ALPS_RESID in the modules so that it executes the ras-alps-command.sh is a good idea. In this case another way would be to add an extra line in this script with the BASIL_RESERVATION_ID as you did for the BATCH_PARTITION_ID.
I have another possible patch then:

Index: ras-alps-command.sh
===================================================================
--- ras-alps-command.sh    (revision 23365)
+++ ras-alps-command.sh    (working copy)
@@ -22,6 +22,13 @@
     exit 0
   fi

+  # If the SLURM BASIL_RESERVATION_ID is set, use it.
+  if [ "${BASIL_RESERVATION_ID}" != "" ]
+  then
+      ${ECHO} ${BASIL_RESERVATION_ID}
+      exit 0
+  fi
+
 # Extract the batch job ID directly from the environment, if available.
   jid=${BATCH_JOBID:--1}
   if [ $jid -eq -1 ]


Thanks for your help in the clarification.

Jerome

On 07/09/2010 05:41 PM, Matney Sr, Kenneth D. wrote:
Hi Jerome,

I am in part responsible for the current incarnation of the ALPS  support in 
OMPI.  We use the
modules environment to set OMPI_ALPS_RESID to the ALPS reservation ID, the 
pertinent
parts of which are:

   set           ridpath                         ${basedir}/share/openmpi
   set           ridname                         ras-alps-command.sh
   set           rid                             ${ridpath}/${ridname}

# Set local cluster parameters for XT5.
   set           resId                           [exec /bin/bash ${rid}]
   setenv        OMPI_ALPS_RESID                 $resId

Originally, the Cray XT systems automatically set an environmental variable, 
BATCH_PARTITION_ID
to the ALPS reservation ID for the job.  However, newer versions do not expose 
the ALPS reservation
ID to the user.  So, we need a way to get the ALPS reservation ID of the Torque 
job.  Unfortunately,
Cray has not made the internal structure of ALPS that does this available.  So, 
we are forced to use
apstat to get this information.  But, apstat is not as robust as we might like. 
 Ergo, the script is used to
loop on apstat until it does not fail.  In the end, we obtain the ALPS 
reservation ID for the current
Torque job and set it to OMPI_ALPS_RESID.  I chose this name so as to avoid 
namespace conflicts.

So, the ALPS RAS mca is being selected, because your patch tells the ALPS RAS 
mca that
BASIL_RESERVATION_ID is equivalent to OMPI_ALPS_RESID.  In turn, while you 
invoke OMPI with
mpirun, the OMPI version of mpirun will select the ALPS PLM mca.  This will 
launch your job with an
aprun (under the covers).  So, your job does show a successful run.  However, 
you may not be taking
the path through mpirun that you intended.

I do hope that I have cleared up some confusion.
--
Ken Matney, Sr.
Oak Ridge National Laboratory


On Jul 9, 2010, at 6:19 AM, Jerome Soumagne wrote:

Hi,

We've recently installed OpenMPI on one of our Cray XT5 machines, here at CSCS. 
This machine uses SLURM for launching jobs.
Doing an salloc defines this environment variable:
               BASIL_RESERVATION_ID
               The reservation ID on Cray systems running ALPS/BASIL only.

Since the alps ras module tries to find a variable called OMPI_ALPS_RESID which 
is set using a script, we thought that for SLURM systems it would be a good 
idea to directly integrate this BASIL_RESERVATION_ID variable in the code, 
rather than using a script. The small patch is attached.

Regards,

Jerome

--
Jérôme Soumagne
Scientific Computing Research Group
CSCS, Swiss National Supercomputing Centre
Galleria 2, Via Cantonale  | Tel: +41 (0)91 610 8258
CH-6928 Manno, Switzerland | Fax: +41 (0)91 610 8282



<patch_slurm_alps.txt><ATT00001..txt>


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to