OK, here is how I (nearly) killed my cluster:
-- Story ---------------------------------------------------
Trying to see GNU parallel in action, I decided to repack collectl's
logfiles. On my system they grow until about 700-900MB (raw) per day
which becomes about 150MB (gziped).
First I put them into a scratch dir and unpacked them. I know that
it would have been possible to unpack/repack them in only one step,
I just wanted the machine also to have some big (data-)files to be
transfered. :-)
Then I starte GNU parallel to use five 32-core machines to pack
these raw files. As they were in a local scratch directory, I
"had" to transfer them to the compute nodes. And there was my
mistake: I used a relative path to the files.
(OK, I'd need to say that the five compute nodes all
have local scratch dirs but also share homes via NFS.)
And there we are: The uncompressed logfiles were transfered to the
compute nodes and placed in the NFS home dir. In other words: The
files were in fact sent back to the head node.
All six machines (headnode and compute nodes) became unusable
quite soon. I guess the nodes cached the data for a while, so
all five machines had huge buffers to feed NFS. :-)
To bring a long story to an end: Killing parallel and rsync did
not help, the headnodes nfsd's were still very busy. I waited
several minutes, the headnodes load was still increasing and
the nodes were unusable, too.
I had to hard reset the nodes to get the headnode back.
-- Question ------------------------------------------------
As I asked before in "issues with --load": Shouldn't we take
more care that (not-so-experienced) users do not overload
their machines by accident?
In this case: Shouldn't GNU parallel detect a situation like
this ("transfer to NFS homes") and exit with an error?
Thomas