Re: Spreading parallel across nodes on HPC system

Rob Sargent Thu, 10 Nov 2022 12:21:39 -0800

On 11/10/22 12:49, Ken Mankoff wrote:

Hello,


I'm trying to run parallel on multiple nodes. Each node may have a different 
number of CPUs. It appears the best syntax for this is from the man page --slf 
section:

8/my-8-cpu-server.example.com
2/[email protected]

My problem is that I'm running in the SLURM environment. I can get the 
hostnames with

scontrol show hostnames $SLURM_JOB_NODELIST > nodelist.0

But I cannot easily get the CPUS-per-node. From the SLURM docs,

SLURM_JOB_CPUS_PER_NODE: Count of CPUs available to the job on the nodes in the 
allocation, using the format CPU_count[(xnumber_of_nodes)][,CPU_count 
[(xnumber_of_nodes)] ...]. For example: SLURM_JOB_CPUS_PER_NODE='72(x2),36' 
indicates that on the first and second nodes (as listed by SLURM_JOB_NODELIST) 
the allocation has 72 CPUs, while the third node has 36 CPUs.

So, parsing '72(x2),36' seems complicated.

If I requested a total of 1000 tasks, but have no control over how many nodes, can I just 
call parallel with -j1000 and pass it a hostfile without the "CPUs/" prepended 
to the hostname? Would parallel then start however many jobs it can per node, and if for 
some reason I was allocated 1000 CPUS on 1 node, that would work fine, as would 1 CPU on 
1000 different nodes?

Thanks,

   -k.

I do this, in slurm bash script, to get the number of jobs I want to run(turns out it's better for me to not load the full hyper-threaded count)


   cores=`grep -c processor /proc/cpuinfo`
   cores=$(( $cores / 2 ))

   parallel --jobs $cores etc :::: <file with list of jobs>

or sometimes the same jobs many times with

   parallel --jobs $cores etc ::: {1..300}

Re: Spreading parallel across nodes on HPC system

Reply via email to