This is no SLURM job file, as it contains no '#SBATCH' directives. (Yes,
they could be given on the command line).
It is also a bit peculiar, as you must think it is necessary to adjust
permissions. This is usually done in so-called prolog scripts, which run
prior to the job start. If your cluster deviates, you should discuss
this with your admins, as it makes your work cumbersome and error prone.
Also, it is not necessary to infer the number of CPUs on a node as the
number of CPUs available in your particular job should be available as
environment variables (see the wiki-link I have given). Please contact
your administrators about these things.
As for the job log: SLURM gathers stdout/stderr as specified by the
sbatch -o and -e directives. They should be directed to shared file
systems. Anything which is local to job, might not be accessible after
the job finished. Whether or not /sratch is a global filesystem or a
local one, cannot be understood from the context.
All in all, you should contact your local helpdesk, there are a number
of things, which might be due to the application or the cluster
settings, not parallel.
On 7/28/22 17:44, Rob Sargent wrote:
On 7/28/22 09:28, Christian Meesters wrote:
On 7/28/22 14:56, Rob Sargent wrote:
On Jul 28, 2022, at 1:10 AM, Christian Meesters<meest...@uni-mainz.de> wrote:
Hi,
not quite. Under SLURM the jobstep starter (SLURM lingo) is "srun". You do not do ssh from job host
to job host, but rather use "parallel" as a semaphore avoiding over subscription of job steps with
"srun". I summarized this approach here:
https://mogonwiki.zdv.uni-mainz.de/dokuwiki/start:working_on_mogon:workflow_organization:node_local_scheduling#running_on_several_hosts
(uh-oh - I need to clean up that site, many outdated sections there, but this
one should still be ok)
One advantage: you can safely utilize the resources of both (or more) hosts -
the master hosts and all secondaries. How much resources you require depends on
your application and the work it does. Be sure to consider I/O (e.g. stage-in
file to avoid random I/O with too many concurrent applications, etc.), if this
is an issue for your application.
Cheers
Christian
Christian,
My use of GNU parallel does not include ssh. Rather I simply fill the slurm
node with —jobs=ncores
That would require to have an interactive job and having
ncores_per_node/threads_per_application ssh-connections, and you have
to manually trigger the script. My solution is to use parallel in a
SLURM-job context and avoid the synchronization step by a human,
whilst offering a potential multi-node job with smp applications.
It's your choice, of course.
if I follow correctly that is what I am doing. Here's my slurm job
#!/bin/bash
LOGDIR=/scratch/general/pe-nfs1/u0138544/logs
chmod a+x $LOGDIR/*
days=$1; shift
tid=$1; shift
if [[ "$tid"x == "x" ]]
then
JOBDIR=`mktemp --directory --tmpdir=$LOGDIR XXXXXX`
tid=$(basename $JOBDIR)
else
JOBDIR=$LOGDIR/$tid
mkdir -p $JOBDIR
fi
. /uufs/chpc.utah.edu/sys/installdir/sgspub/bin/sgsCP.sh
chmod -R a+rwx $JOBDIR
rnow=$(date +%s)
rsec=$(( $days * 24 * 3600 ))
endtime=$(( $rnow+$rsec ))
cores=`grep -c processor /proc/cpuinfo`
cores=$(( $cores / 2 ))
trap "chmod -R a+rw $JOBDIR" SIGCONT SIGTERM
parallel \
--joblog $JOBDIR/${tid}.ll \
--verbose \
--jobs $cores \
--delay 1 \
/uufs/chpc.utah.edu/sys/installdir/sgspub/bin/chaser-10Mt
83a9a2ad-fe16-4872-b629-b9ba70ed5bbb $endtime $JOBDIR ::: {1..750}
chmod a+rw $JOBDIR/${tid}.ll
If the complete job finishes nicely then I can read/write the job
log. the trap is there in case the slurm job exceeds time limits.
But while things are running, I cannot look at the '.ll' file
rjs