Re: [gpfsug-discuss] Building GPFS filesystem system data pool on shared nothing NVMe drives

Luis Bolinches Tue, 30 Jul 2019 07:54:58 -0700

Thanks Paul for jumping in.

@David

As Paul mentioned if you used ESS before (Spectrum Scale Native RAID) ECE extends the Native RAID over the network. So when you define a pdisk (NVMe device in your case) to be part of a vdisk of a certain RAID protection (4+3p for the 4 nodes or even 6 as Paul mentioned), the RAID logical blocks are spread across multiple (all in your case) pdisks. Then you would do a normal NSD device from each vdisk and use it normally inside a filesystem

X pdisks (block/nvme devices) <-> Y vdisks (RAID devices) <-> Y NSD

so vdisk which is the software RAID becomes the block device that NSD use, instead a traditional native block device (HDD, LUN, SDD, NVMe, ...)

Lots of the complexity of this has been taken away with mmvdisk (thanks to the people involved on that!!) . I really would recommend you to use ECE for this setup, Paul has already highlighted operational and non operational advantages. For your HW if you pass the following check individually (mem, cpu, ...) https://github.com/IBM/SpectrumScale_ECE_OS_READINESS you would be a really good candidate for ECE (it is already GA'd since last June)

If you need a contact from IBM in US to follow up I am pretty sure we can arrange that too.

--
Ystävällisin terveisin / Kind regards / Saludos cordiales / Salutations
Luis Bolinches
Certified Consultant IT Specialist
Mobile Phone: +358503112585
https://www.youracclaim.com/user/luis-bolinches

"If you always give you will always have" -- Anonymous

----- Original message -----
From: "Sanchez, Paul" <[email protected]>
Sent by: [email protected]
To: gpfsug main discussion list <[email protected]>
Cc:
Subject: [EXTERNAL] Re: [gpfsug-discuss] Building GPFS filesystem system data pool on shared nothing NVMe drives
Date: Tue, Jul 30, 2019 5:36 PM

Yes, read the documentation for the mmvdisk command to get a better idea of how this actually works. There’s definitely a small paradigmatic shift in thinking that you’ll encounter between traditional NSDs and ECE. The mmvdisk command’s defaults handle just about everything for you and the defaults are sane.

The short answer is that in a 6-node ECE recoverygroup, each of the nodes will normally provide access to 2 NSDs and the system pool would therefore have 12 NSDs total. If a node fails, each of its two NSDs will continue to be served each by a different one of the remaining servers. If you’ve used ESS before, then you can think about the RAID/spare/rebuild space similarly to that: all drives have some spare space on them and erasure-code stripes are evenly distributed among all of the disks so that they are all used and fill at approximately the same rate.

From: [email protected] <[email protected]> On Behalf Of David Johnson
Sent: Tuesday, July 30, 2019 9:04 AM
To: gpfsug main discussion list <[email protected]>
Subject: Re: [gpfsug-discuss] Building GPFS filesystem system data pool on shared nothing NVMe drives

This message was sent by an external party.

OK, so the ECE recovery group is the four NSD servers with the System storage pool disks, and somehow I have to read the docs

and find out how to define pdisks that spread the replication across the four servers, but three disks at a time.

Three pdisks of 7 drives, three I can't do anything with, or are those for rebuilding space?

Can you provide me details of your six-node non-ECE configuration? Basically how the NSDs are defined...

The remainder of our new filesystem will have a fast pool of 12 nodes of excelero, and 2Pb of spinning disks, so another possibility

would be to license four more nodes and put the system pool under excelero.

-- ddj

On Jul 30, 2019, at 8:19 AM, Sanchez, Paul <[email protected]> wrote:

Hi David,

In an ECE configuration, it would be typical to put all of the NVMe disks in all 4 of your servers into a single recovery group. So in your case, all 24 NVMe drives would be in one recovery group and the 4 servers would be “log group” servers in the recovery group, distributing the I/O load for the NSD/vdisks that are hosted on the RG. (The minimum disks for a single RG config is 12, and you meet that easily.)

https://www.ibm.com/support/knowledgecenter/STXKQY_ECE_5.0.3/com.ibm.spectrum.scale.ece.v5r03.doc/b1lece_plan_recommendations.htm

outlines the recommendations for raidCode protection. Your configuration (4 nodes) would use vdisks with 4+3P, which gives you a slightly better capacity yield than RAID10 would, but with much better recovery characteristics:

·         No single failed node will result in a down system NSD.

·         No single drive failure will require a critical priority rebuild, and can be handled in the background without killing performance.

So from that perspective, ECE is a win here and avoids a problem with the non-ECE, shared-nothing designs: the manual “mmchdisk <fsname> start -a” operation that is needed after any traditional shared-nothing metadata NSD goes offline to bring it back and protect against further failures.

Despite the operational challenges of the non-ECE design, it can sometimes survive two server failures (if replication factor is 3 and the filesystem descriptor quorum wasn’t lost by the two failures) which a 4 node ECE cluster cannot. Given that the world is complex and unexpected things can happen, I’d personally recommend redistributing the 24 disks across 6 servers if you can, so that the design could always survive 2 node failures. I’ve run this design and it’s fairly robust.

In any event, you should of course test the failure scenarios yourself before going into production to validate them and familiarize yourself with the process. And a special note on ECE: due to the cooperative nature at the pdisk level, the network between the servers in the RG should be as reliable as possible and any network redundancy should also be tested ahead of time.

-Paul

From: [email protected] <[email protected]> On Behalf Of David Johnson
Sent: Tuesday, July 30, 2019 7:46 AM
To: gpfsug main discussion list <[email protected]>
Subject: Re: [gpfsug-discuss] Building GPFS filesystem system data pool on shared nothing NVMe drives

This message was sent by an external party.

Can we confirm the requirement for disks per RG? I have 4 RG, but only 6 x 3TB NVMe drives per box.

On Jul 29, 2019, at 1:34 PM, Luis Bolinches <[email protected]> wrote:

Hi, from phone so sorry for typos.

I really think you should look into Spectrum Scale Erasure Code Edition (ECE) for this.

Sure you could do a RAID on each node as you mention here but that sounds like a lot of waste to me on storage capacity. Not to forget you get other goodies like end to end checksum and rapid rebuilds with ECE, among others.

Four servers is the minimum requirement for ECE (4+3p) and from top of my head 12 disk per RG, you are fine with both requirements.

There is a presentation on ECE on the user group web page from London May 2019 were we talk about ECE.

And the ibm page of the product https://www.ibm.com/support/knowledgecenter/STXKQY_ECE_5.0.3/com.ibm.spectrum.scale.ece.v5r03.doc/b1lece_intro.htm

--

Cheers

El 29 jul 2019, a las 19:06, David Johnson <[email protected]> escribió:

We are planning a 5.0.x upgrade onto new hardware to make use of the new 5.x GPFS features.
The goal is to use up to four NSD nodes for metadata, each one with 6 NVMe drives (to be determined
whether we use Intel VROC for raid 5 or raid 1, or just straight disks).

So questions —
Has anyone done system pool on shared nothing cluster? How did you set it up?
With default metadata replication set at 3, can you make use of four NSD nodes effectively?
How would one design the location vectors and failure groups so that the system metadata is
spread evenly across the four servers?

Thanks,
— ddj
Dave Johnson
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Ellei edellä ole toisin mainittu: / Unless stated otherwise above:
Oy IBM Finland Ab
PL 265, 00101 Helsinki, Finland
Business ID, Y-tunnus: 0195876-3
Registered in Finland

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Ellei edellä ole toisin mainittu: / Unless stated otherwise above:
Oy IBM Finland Ab
PL 265, 00101 Helsinki, Finland
Business ID, Y-tunnus: 0195876-3
Registered in Finland

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] Building GPFS filesystem system data pool on shared nothing NVMe drives

Reply via email to