This is no SLURM job file, as it contains no '#SBATCH' directives. (Yes, they could be given on the command line).

It is also a bit peculiar, as you must think it is necessary to adjust permissions. This is usually done in so-called prolog scripts, which run prior to the job start. If your cluster deviates, you should discuss this with your admins, as it makes your work cumbersome and error prone. Also, it is not necessary to infer the number of CPUs on a node as the number of CPUs available in your particular job should be available as environment variables (see the wiki-link I have given). Please contact your administrators about these things.

As for the job log: SLURM gathers stdout/stderr as specified by the sbatch -o and -e directives. They should be directed to shared file systems. Anything which is local to job, might not be accessible after the job finished. Whether or not /sratch is a global filesystem or a local one, cannot be understood from the context.

All in all, you should contact your local helpdesk, there are a number of things, which might be due to the application or the cluster settings, not parallel.



On 7/28/22 17:44, Rob Sargent wrote:
On 7/28/22 09:28, Christian Meesters wrote:


On 7/28/22 14:56, Rob Sargent wrote:
On Jul 28, 2022, at 1:10 AM, Christian Meesters<meest...@uni-mainz.de>  wrote:
Hi,

not quite. Under SLURM the jobstep starter (SLURM lingo) is "srun". You do not do ssh from job host 
to job host, but rather use "parallel" as a semaphore avoiding over subscription of job steps with 
"srun". I summarized this approach here:

https://mogonwiki.zdv.uni-mainz.de/dokuwiki/start:working_on_mogon:workflow_organization:node_local_scheduling#running_on_several_hosts
  (uh-oh - I need to clean up that site, many outdated sections there, but this 
one should still be ok)

One advantage: you can safely utilize the resources of both (or more) hosts - 
the master hosts and all secondaries. How much resources you require depends on 
your application and the work it does. Be sure to consider I/O (e.g. stage-in 
file to avoid random I/O with too many concurrent applications, etc.), if this 
is an issue for your application.

Cheers

Christian
Christian,
My use of GNU parallel does not include ssh. Rather I simply fill the slurm  
node with —jobs=ncores

That would require to have an interactive job and having ncores_per_node/threads_per_application ssh-connections, and you have to manually trigger the script. My solution is to use parallel in a SLURM-job context and avoid the synchronization step by a human, whilst offering a potential multi-node job with smp applications. It's your choice, of course.


if I follow correctly that is what I am doing.  Here's my slurm job

    #!/bin/bash
    LOGDIR=/scratch/general/pe-nfs1/u0138544/logs
    chmod a+x $LOGDIR/*
    days=$1; shift
    tid=$1; shift

    if [[ "$tid"x == "x" ]]
    then
        JOBDIR=`mktemp --directory --tmpdir=$LOGDIR XXXXXX`
        tid=$(basename $JOBDIR)
    else
        JOBDIR=$LOGDIR/$tid
        mkdir -p $JOBDIR
    fi
    . /uufs/chpc.utah.edu/sys/installdir/sgspub/bin/sgsCP.sh

    chmod -R a+rwx $JOBDIR
    rnow=$(date +%s)
    rsec=$(( $days * 24 * 3600 ))
    endtime=$(( $rnow+$rsec ))

    cores=`grep -c processor /proc/cpuinfo`
    cores=$(( $cores / 2 ))

    trap "chmod -R a+rw $JOBDIR" SIGCONT SIGTERM

    parallel \
        --joblog $JOBDIR/${tid}.ll \
        --verbose \
        --jobs $cores \
        --delay 1 \
        /uufs/chpc.utah.edu/sys/installdir/sgspub/bin/chaser-10Mt
    83a9a2ad-fe16-4872-b629-b9ba70ed5bbb $endtime $JOBDIR ::: {1..750}
    chmod a+rw $JOBDIR/${tid}.ll

If the complete job finishes nicely then I can read/write the job log.  the trap is there in case the slurm job exceeds time limits.  But while things are running, I cannot look at the '.ll' file
rjs

Reply via email to