It looks like you are proposing a setup which uses your compute servers as 
storage servers also?

  *   I'm thinking about the following setup:
~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected

There is nothing wrong with this concept, for instance see

I have an NVMe filesystem which uses 60 drives, but there are 10 servers.
You should look at "failure zones" also.

Hi Lukas,

Check out FPO mode. That mimics Hadoop's data placement features. You can have 
up to 3 replicas both data and metadata but still the downside, though, as you 
say is the wrong node failures will take your cluster down.

You might want to check out something like Excelero's NVMesh (note: not an 
endorsement since I can't give such things) which can create logical volumes 
across all your NVMe drives. The product has erasure coding on their roadmap. 
I'm not sure if they've released that feature yet but in theory it will give 
better fault tolerance *and* you'll get more efficient usage of your SSDs.

I'm sure there are other ways to skin this cat too.


I would like to setup shared scratch area using GPFS and those NVMe SSDs. Each
SSDs as on NSD.

I don't think like 5 or more data/metadata replicas are practical here. On the
other hand, multiple node failures is something really expected.

Is there a way to instrument that local NSD is strongly preferred to store
data? I.e. node failure most probably does not result in unavailable data for
the other nodes?

Or is there any other recommendation/solution to build shared scratch with
GPFS in such setup? (Do not do it including.)

Lukáš Hejtmánek
