Hi Lukas, I would like to discourage you from building a large distributed clustered filesystem made of many unreliable components. You will need to overprovision your interconnect and will also spend a lot of time in "healing" or "degraded" state.
It is typically cheaper to centralize the storage into a subset of nodes and configure those to be more highly available. E.g. of your 60 nodes, take 8 and put all the storage into those and make that a dedicated GPFS cluster with no compute jobs on those nodes. Again, you'll still need really beefy and reliable interconnect to make this work. Stepping back; what is the actual problem you're trying to solve? I have certainly been in that situation before, where the problem is more like: "I have a fixed hardware configuration that I can't change, and I want to try to shoehorn a parallel filesystem onto that." I would recommend looking closer at your actual workloads. If this is a "scratch" filesystem and file access is mostly from one node at a time, it's not very useful to make two additional copies of that data on other nodes, and it will only slow you down. Regards, Alex On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek <[email protected]> wrote: > On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: > > Lukas, > > It looks like you are proposing a setup which uses your compute servers > as storage servers also? > > yes, exactly. I would like to utilise NVMe SSDs that are in every compute > servers.. Using them as a shared scratch area with GPFS is one of the > options. > > > > > * I'm thinking about the following setup: > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected > > > > There is nothing wrong with this concept, for instance see > > https://www.beegfs.io/wiki/BeeOND > > > > I have an NVMe filesystem which uses 60 drives, but there are 10 servers. > > You should look at "failure zones" also. > > you still need the storage servers and local SSDs to use only for caching, > do > I understand correctly? > > > > > From: [email protected] [mailto:gpfsug-discuss- > [email protected]] On Behalf Of Knister, Aaron S. > (GSFC-606.2)[COMPUTER SCIENCE CORP] > > Sent: Monday, March 12, 2018 4:14 PM > > To: gpfsug main discussion list <[email protected]> > > Subject: Re: [gpfsug-discuss] Preferred NSD > > > > Hi Lukas, > > > > Check out FPO mode. That mimics Hadoop's data placement features. You > can have up to 3 replicas both data and metadata but still the downside, > though, as you say is the wrong node failures will take your cluster down. > > > > You might want to check out something like Excelero's NVMesh (note: not > an endorsement since I can't give such things) which can create logical > volumes across all your NVMe drives. The product has erasure coding on > their roadmap. I'm not sure if they've released that feature yet but in > theory it will give better fault tolerance *and* you'll get more efficient > usage of your SSDs. > > > > I'm sure there are other ways to skin this cat too. > > > > -Aaron > > > > > > > > On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek <[email protected] > <mailto:[email protected]>> wrote: > > Hello, > > > > I'm thinking about the following setup: > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected > > > > I would like to setup shared scratch area using GPFS and those NVMe > SSDs. Each > > SSDs as on NSD. > > > > I don't think like 5 or more data/metadata replicas are practical here. > On the > > other hand, multiple node failures is something really expected. > > > > Is there a way to instrument that local NSD is strongly preferred to > store > > data? I.e. node failure most probably does not result in unavailable > data for > > the other nodes? > > > > Or is there any other recommendation/solution to build shared scratch > with > > GPFS in such setup? (Do not do it including.) > > > > -- > > Lukáš Hejtmánek > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -- The information contained in this communication and any attachments > is confidential and may be privileged, and is for the sole use of the > intended recipient(s). Any unauthorized review, use, disclosure or > distribution is prohibited. Unless explicitly stated otherwise in the body > of this communication or the attachment thereto (if any), the information > is provided on an AS-IS basis without any express or implied warranties or > liabilities. To the extent you are relying on this information, you are > doing so at your own risk. If you are not the intended recipient, please > notify the sender immediately by replying to this message and destroy all > copies of this message and any attachments. Neither the sender nor the > company/group of companies he or she represents shall be liable for the > proper and complete transmission of the information contained in this > communication, or for any delay in its receipt. > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > -- > Lukáš Hejtmánek > > Linux Administrator only because > Full Time Multitasking Ninja > is not an official job title > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss >
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
