On 7/29/22 23:52, Ole Tange wrote:
On Thu, Jul 28, 2022 at 5:45 PM Rob Sargent<robjsarg...@gmail.com>  wrote:
On 7/28/22 09:28, Christian Meesters wrote:


On 7/28/22 14:56, Rob Sargent wrote:

On Jul 28, 2022, at 1:10 AM, Christian Meesters<meest...@uni-mainz.de>  wrote:
Hi,

not quite. Under SLURM the jobstep starter (SLURM lingo) is "srun". You do not do ssh from job host 
to job host, but rather use "parallel" as a semaphore avoiding over subscription of job steps with 
"srun". I summarized this approach here:

https://mogonwiki.zdv.uni-mainz.de/dokuwiki/start:working_on_mogon:workflow_organization:node_local_scheduling#running_on_several_hosts
  (uh-oh - I need to clean up that site, many outdated sections there, but this 
one should still be ok)

One advantage: you can safely utilize the resources of both (or more) hosts - 
the master hosts and all secondaries. How much resources you require depends on 
your application and the work it does. Be sure to consider I/O (e.g. stage-in 
file to avoid random I/O with too many concurrent applications, etc.), if this 
is an issue for your application.

Cheers

Christian

Christian,
My use of GNU parallel does not include ssh. Rather I simply fill the slurm  
node with —jobs=ncores

That would require to have an interactive job and having 
ncores_per_node/threads_per_application ssh-connections, and you have to 
manually trigger the script. My solution is to use parallel in a SLURM-job 
context and avoid the synchronization step by a human, whilst offering a 
potential multi-node job with smp applications. It's your choice, of course.


if I follow correctly that is what I am doing.  Here's my slurm job
Would this work:

#!/bin/bash
LOGDIR=/scratch/general/pe-nfs1/u0138544/logs
chmod a+x $LOGDIR/*
logfile="$LOGDIR"/mylog.$$
touch "$logfile"
chmod -R a+rw "$LOGDIR"

. /uufs/chpc.utah.edu/sys/installdir/sgspub/bin/sgsCP.sh

parallel \
     --joblog +"$joblog" \
      --verbose \
      --jobs 50% \
      --delay 1 \
      /uufs/chpc.utah.edu/sys/installdir/sgspub/bin/chaser-10Mt
83a9a2ad-fe16-4872-b629-b9ba70ed5bbb $endtime $JOBDIR ::: {1..750}

The idea is the same as my original: Make the file and set the
permissions before starting GNU Parallel.

/Ole


Still not correct.  I'm now getting

   "parallel: Error: Cannot write to --joblog
   +/scratch/general/pe-nfs1/u0138544/logs/o2iY02/o2iY02.ll."
   //I believe the final period on the line above comes from the error
   message generator.

After trying the plus-sign

   touch      $JOBDIR/${tid}.ll
   chmod a+rw $JOBDIR/${tid}.ll
   . /uufs/chpc.utah.edu/sys/installdir/sgspub/bin/sgsCP.sh
   parallel \
        --joblog +$JOBDIR/${tid}.ll \
        --verbose \
        --jobs $cores \
        --delay 1 \
   /uufs/chpc.utah.edu/sys/installdir/sgspub/bin/chaser-10Mt
   83a9a2ad-fe16-4872-b629-b9ba70ed5bbb $endtime $JOBDIR ::: {1..750}

And here's (emacs's view of) the directory

   /scratch/general/pe-nfs1/u0138544/logs/o2iY02:
      total used in directory 12 available 243131850752
      drwxrwxrwx   2 hcipepipeline hci     31 Jul 31 03:19 .
      drwxrwxr-x+ 19 u0138544      camp 12288 Jul 31 03:19 ..
      -rw-rw-rw-   1 hcipepipeline hci      0 Jul 31 03:19 o2iY02.ll
   //  The "r-x+" is not bourne out by bash.  I can cd to 02iY02.ll and
   touch foo (as myself, not the "hci" account)

For now I've reverted to not using the plus-sign until I know how to.

Perhaps I need to supply the header line?

Thanks,
rjs



Reply via email to