Hello, thank you for insight. Well, the point is, that I will get ~60 with 120 NVMe disks in it, each about 2TB size. It means that I will have 240TB in NVMe SSD that could build nice shared scratch. Moreover, I have no different HW or place to put these SSDs into. They have to be in the compute nodes.
On Tue, Mar 13, 2018 at 10:48:21AM -0700, Alex Chekholko wrote: > I would like to discourage you from building a large distributed clustered > filesystem made of many unreliable components. You will need to > overprovision your interconnect and will also spend a lot of time in > "healing" or "degraded" state. > > It is typically cheaper to centralize the storage into a subset of nodes > and configure those to be more highly available. E.g. of your 60 nodes, > take 8 and put all the storage into those and make that a dedicated GPFS > cluster with no compute jobs on those nodes. Again, you'll still need > really beefy and reliable interconnect to make this work. > > Stepping back; what is the actual problem you're trying to solve? I have > certainly been in that situation before, where the problem is more like: "I > have a fixed hardware configuration that I can't change, and I want to try > to shoehorn a parallel filesystem onto that." > > I would recommend looking closer at your actual workloads. If this is a > "scratch" filesystem and file access is mostly from one node at a time, > it's not very useful to make two additional copies of that data on other > nodes, and it will only slow you down. > > Regards, > Alex > > On Tue, Mar 13, 2018 at 7:16 AM, Lukas Hejtmanek <xhejt...@ics.muni.cz> > wrote: > > > On Tue, Mar 13, 2018 at 10:37:43AM +0000, John Hearns wrote: > > > Lukas, > > > It looks like you are proposing a setup which uses your compute servers > > as storage servers also? > > > > yes, exactly. I would like to utilise NVMe SSDs that are in every compute > > servers.. Using them as a shared scratch area with GPFS is one of the > > options. > > > > > > > > * I'm thinking about the following setup: > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected > > > > > > There is nothing wrong with this concept, for instance see > > > https://www.beegfs.io/wiki/BeeOND > > > > > > I have an NVMe filesystem which uses 60 drives, but there are 10 servers. > > > You should look at "failure zones" also. > > > > you still need the storage servers and local SSDs to use only for caching, > > do > > I understand correctly? > > > > > > > > From: gpfsug-discuss-boun...@spectrumscale.org [mailto:gpfsug-discuss- > > boun...@spectrumscale.org] On Behalf Of Knister, Aaron S. > > (GSFC-606.2)[COMPUTER SCIENCE CORP] > > > Sent: Monday, March 12, 2018 4:14 PM > > > To: gpfsug main discussion list <gpfsug-discuss@spectrumscale.org> > > > Subject: Re: [gpfsug-discuss] Preferred NSD > > > > > > Hi Lukas, > > > > > > Check out FPO mode. That mimics Hadoop's data placement features. You > > can have up to 3 replicas both data and metadata but still the downside, > > though, as you say is the wrong node failures will take your cluster down. > > > > > > You might want to check out something like Excelero's NVMesh (note: not > > an endorsement since I can't give such things) which can create logical > > volumes across all your NVMe drives. The product has erasure coding on > > their roadmap. I'm not sure if they've released that feature yet but in > > theory it will give better fault tolerance *and* you'll get more efficient > > usage of your SSDs. > > > > > > I'm sure there are other ways to skin this cat too. > > > > > > -Aaron > > > > > > > > > > > > On March 12, 2018 at 10:59:35 EDT, Lukas Hejtmanek <xhejt...@ics.muni.cz > > <mailto:xhejt...@ics.muni.cz>> wrote: > > > Hello, > > > > > > I'm thinking about the following setup: > > > ~ 60 nodes, each with two enterprise NVMe SSDs, FDR IB interconnected > > > > > > I would like to setup shared scratch area using GPFS and those NVMe > > SSDs. Each > > > SSDs as on NSD. > > > > > > I don't think like 5 or more data/metadata replicas are practical here. > > On the > > > other hand, multiple node failures is something really expected. > > > > > > Is there a way to instrument that local NSD is strongly preferred to > > store > > > data? I.e. node failure most probably does not result in unavailable > > data for > > > the other nodes? > > > > > > Or is there any other recommendation/solution to build shared scratch > > with > > > GPFS in such setup? (Do not do it including.) > > > > > > -- > > > Lukáš Hejtmánek > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > -- The information contained in this communication and any attachments > > is confidential and may be privileged, and is for the sole use of the > > intended recipient(s). Any unauthorized review, use, disclosure or > > distribution is prohibited. Unless explicitly stated otherwise in the body > > of this communication or the attachment thereto (if any), the information > > is provided on an AS-IS basis without any express or implied warranties or > > liabilities. To the extent you are relying on this information, you are > > doing so at your own risk. If you are not the intended recipient, please > > notify the sender immediately by replying to this message and destroy all > > copies of this message and any attachments. Neither the sender nor the > > company/group of companies he or she represents shall be liable for the > > proper and complete transmission of the information contained in this > > communication, or for any delay in its receipt. > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > -- > > Lukáš Hejtmánek > > > > Linux Administrator only because > > Full Time Multitasking Ninja > > is not an official job title > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Lukáš Hejtmánek Linux Administrator only because Full Time Multitasking Ninja is not an official job title _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss