Hi all,

Reposting here from stack overflow upon request. Since the functionality
doesn't currently exist, I guess this could be described as an enhancement
suggestion. Let me describe the use case, and then the functionality I was
proposing.

My GNU parallel use case is mostly to manage batch processing within SLURM
on a HPC cluster. I know a few others in my community who also do, mostly
at NERSC because of their documentation suggesting it (
https://docs.nersc.gov/jobs/workflow/gnuparallel/). However, a lot of the
larger academic computing groups often have group-owned machines on the
cluster (which are outside of SLURM control) or have access to multiple
different queues. I think it would be nice to be able to create the parent
GNU parallel process on a machine that you own (and so it is always
running) and when a SLURM allocation is granted on one queue or another,
those machines just add their addresses to the nodelist of the GNU parallel
job. This allows the job to keep running and make maximal use of
fluctuating resources.

I think the only "feature" really needed to make this possible is a flag
that changes how frequently the "nodelist" is checked. Personally, my tasks
are often 8h+ and I wouldn't want to waste 8h of an allocation waiting for
the parent process to have a task return before it checks the nodelist
again.

Would be interested to hear if other people have similar use cases/would
benefit and how hard it would be to add that functionality.

Thanks,
Andrew Saydjari

Reply via email to