Re: file permissions on joblog

Christian Meesters Thu, 28 Jul 2022 10:46:40 -0700

This is no SLURM job file, as it contains no '#SBATCH' directives. (Yes,they could be given on the command line).

It is also a bit peculiar, as you must think it is necessary to adjustpermissions. This is usually done in so-called prolog scripts, which runprior to the job start. If your cluster deviates, you should discussthis with your admins, as it makes your work cumbersome and error prone.Also, it is not necessary to infer the number of CPUs on a node as thenumber of CPUs available in your particular job should be available asenvironment variables (see the wiki-link I have given). Please contactyour administrators about these things.

As for the job log: SLURM gathers stdout/stderr as specified by thesbatch -o and -e directives. They should be directed to shared filesystems. Anything which is local to job, might not be accessible afterthe job finished. Whether or not /sratch is a global filesystem or alocal one, cannot be understood from the context.

All in all, you should contact your local helpdesk, there are a numberof things, which might be due to the application or the clustersettings, not parallel.




On 7/28/22 17:44, Rob Sargent wrote:

On 7/28/22 09:28, Christian Meesters wrote:



On 7/28/22 14:56, Rob Sargent wrote:

On Jul 28, 2022, at 1:10 AM, Christian Meesters<meest...@uni-mainz.de>  wrote:
Hi,

not quite. Under SLURM the jobstep starter (SLURM lingo) is "srun". You do not do ssh from job host 
to job host, but rather use "parallel" as a semaphore avoiding over subscription of job steps with 
"srun". I summarized this approach here:

https://mogonwiki.zdv.uni-mainz.de/dokuwiki/start:working_on_mogon:workflow_organization:node_local_scheduling#running_on_several_hosts
  (uh-oh - I need to clean up that site, many outdated sections there, but this 
one should still be ok)

One advantage: you can safely utilize the resources of both (or more) hosts - 
the master hosts and all secondaries. How much resources you require depends on 
your application and the work it does. Be sure to consider I/O (e.g. stage-in 
file to avoid random I/O with too many concurrent applications, etc.), if this 
is an issue for your application.

Cheers

Christian

Christian,
My use of GNU parallel does not include ssh. Rather I simply fill the slurm  
node with —jobs=ncores

That would require to have an interactive job and havingncores_per_node/threads_per_application ssh-connections, and you haveto manually trigger the script. My solution is to use parallel in aSLURM-job context and avoid the synchronization step by a human,whilst offering a potential multi-node job with smp applications.It's your choice, of course.

if I follow correctly that is what I am doing.  Here's my slurm job

    #!/bin/bash
    LOGDIR=/scratch/general/pe-nfs1/u0138544/logs
    chmod a+x $LOGDIR/*
    days=$1; shift
    tid=$1; shift

    if [[ "$tid"x == "x" ]]
    then
        JOBDIR=`mktemp --directory --tmpdir=$LOGDIR XXXXXX`
        tid=$(basename $JOBDIR)
    else
        JOBDIR=$LOGDIR/$tid
        mkdir -p $JOBDIR
    fi
    . /uufs/chpc.utah.edu/sys/installdir/sgspub/bin/sgsCP.sh

    chmod -R a+rwx $JOBDIR
    rnow=$(date +%s)
    rsec=$(( $days * 24 * 3600 ))
    endtime=$(( $rnow+$rsec ))

    cores=`grep -c processor /proc/cpuinfo`
    cores=$(( $cores / 2 ))

    trap "chmod -R a+rw $JOBDIR" SIGCONT SIGTERM

    parallel \
        --joblog $JOBDIR/${tid}.ll \
        --verbose \
        --jobs $cores \
        --delay 1 \
        /uufs/chpc.utah.edu/sys/installdir/sgspub/bin/chaser-10Mt
    83a9a2ad-fe16-4872-b629-b9ba70ed5bbb $endtime $JOBDIR ::: {1..750}
    chmod a+rw $JOBDIR/${tid}.ll

If the complete job finishes nicely then I can read/write the joblog. the trap is there in case the slurm job exceeds time limits. But while things are running, I cannot look at the '.ll' file

rjs

Re: file permissions on joblog

Reply via email to