transfer and NFS homes

Thomas Sattler Thu, 15 Mar 2012 14:47:29 -0700

OK, here is how I (nearly) killed my cluster:

-- Story ---------------------------------------------------


Trying to see GNU parallel in action, I decided to repack collectl's
logfiles. On my system they grow until about 700-900MB (raw) per day
which becomes about 150MB (gziped).

First I put them into a scratch dir and unpacked them. I know that
it would have been possible to unpack/repack them in only one step,
I just wanted the machine also to have some big (data-)files to be
transfered. :-)

Then I starte GNU parallel to use five 32-core machines to pack
these raw files. As they were in a local scratch directory, I
"had" to transfer them to the compute nodes. And there was my
mistake: I used a relative path to the files.

  (OK, I'd need to say that the five compute nodes all
   have local scratch dirs but also share homes via NFS.)

And there we are: The uncompressed logfiles were transfered to the
compute nodes and placed in the NFS home dir. In other words: The
files were in fact sent back to the head node.

All six machines (headnode and compute nodes) became unusable
quite soon. I guess the nodes cached the data for a while, so
all five machines had huge buffers to feed NFS. :-)

To bring a long story to an end: Killing parallel and rsync did
not help, the headnodes nfsd's were still very busy. I waited
several minutes, the headnodes load was still increasing and
the nodes were unusable, too.

I had to hard reset the nodes to get the headnode back.

-- Question ------------------------------------------------

As I asked before in "issues with --load": Shouldn't we take
more care that (not-so-experienced) users do not overload
their machines by accident?

In this case: Shouldn't GNU parallel detect a situation like
this ("transfer to NFS homes") and exit with an error?

Thomas

transfer and NFS homes

Reply via email to