On 7/28/22 09:28, Christian Meesters wrote:


On 7/28/22 14:56, Rob Sargent wrote:
On Jul 28, 2022, at 1:10 AM, Christian Meesters<meest...@uni-mainz.de>  wrote:
Hi,

not quite. Under SLURM the jobstep starter (SLURM lingo) is "srun". You do not do ssh from job host 
to job host, but rather use "parallel" as a semaphore avoiding over subscription of job steps with 
"srun". I summarized this approach here:

https://mogonwiki.zdv.uni-mainz.de/dokuwiki/start:working_on_mogon:workflow_organization:node_local_scheduling#running_on_several_hosts
  (uh-oh - I need to clean up that site, many outdated sections there, but this 
one should still be ok)

One advantage: you can safely utilize the resources of both (or more) hosts - 
the master hosts and all secondaries. How much resources you require depends on 
your application and the work it does. Be sure to consider I/O (e.g. stage-in 
file to avoid random I/O with too many concurrent applications, etc.), if this 
is an issue for your application.

Cheers

Christian
Christian,
My use of GNU parallel does not include ssh. Rather I simply fill the slurm  
node with —jobs=ncores

That would require to have an interactive job and having ncores_per_node/threads_per_application ssh-connections, and you have to manually trigger the script. My solution is to use parallel in a SLURM-job context and avoid the synchronization step by a human, whilst offering a potential multi-node job with smp applications. It's your choice, of course.


if I follow correctly that is what I am doing.  Here's my slurm job

   #!/bin/bash
   LOGDIR=/scratch/general/pe-nfs1/u0138544/logs
   chmod a+x $LOGDIR/*
   days=$1; shift
   tid=$1; shift

   if [[ "$tid"x == "x" ]]
   then
        JOBDIR=`mktemp --directory --tmpdir=$LOGDIR XXXXXX`
        tid=$(basename $JOBDIR)
   else
        JOBDIR=$LOGDIR/$tid
        mkdir -p $JOBDIR
   fi
   . /uufs/chpc.utah.edu/sys/installdir/sgspub/bin/sgsCP.sh

   chmod -R a+rwx $JOBDIR
   rnow=$(date +%s)
   rsec=$(( $days * 24 * 3600 ))
   endtime=$(( $rnow+$rsec ))

   cores=`grep -c processor /proc/cpuinfo`
   cores=$(( $cores / 2 ))

   trap "chmod -R a+rw $JOBDIR" SIGCONT SIGTERM

   parallel \
        --joblog $JOBDIR/${tid}.ll \
        --verbose \
        --jobs $cores \
        --delay 1 \
        /uufs/chpc.utah.edu/sys/installdir/sgspub/bin/chaser-10Mt
   83a9a2ad-fe16-4872-b629-b9ba70ed5bbb $endtime $JOBDIR ::: {1..750}
   chmod a+rw $JOBDIR/${tid}.ll

If the complete job finishes nicely then I can read/write the job log.  the trap is there in case the slurm job exceeds time limits.  But while things are running, I cannot look at the '.ll' file
rjs

Reply via email to