Dear parallel developers,
on our hpc cluster, when jobs on different nodes use parallel
simultaneously, they abort with the following error message:
parallel: This should not happen. You have found a bug.
Please contact <[email protected]> and include:
* The version number: 20170522
* The bugid: loadavg_invalid_content:
/home/thomasd/.parallel/tmp/sshlogin/:/loadavg
This is the command being run
parallel --tmpdir /dev/shm/pbs.2448832.hpc-pbs --no-notice --load
100% --delay 1 /home/thomasd/create_lut.sh -m 0 -l 4 -p 0 -s 2 -v {1}
-r {2} /home/thomasd/grid_hpc.cfg ::: {0..12} ::: {0..8}
When only one parallel process is running at a time, it works fine.
I think the parallel jobs on different nodes, which share the same
home directory, are accessing the same “loadavg” file.
As a workaround I can pass each parallel job a different
$PARALLEL_HOME environment variable, and this seems to avoid the
problem. When I look in those directories, they each contain a
directory named after the hostname of the node used for the job (i.e.
tmp/sshlogin/hpc-nXYZ), and a “:” directory (tmp/sshlogin/:).
Sincerely,
Thomas Danckaert