Re: [gpfsug-discuss] Building GPFS filesystem system data pool on shared nothing NVMe drives

Sanchez, Paul Tue, 30 Jul 2019 05:29:25 -0700

Hi David,

In an ECE configuration, it would be typical to put all of the NVMe disks in 
all 4 of your servers into a single recovery group.   So in your case, all 24 
NVMe drives would be in one recovery group and the 4 servers would be “log 
group” servers in the recovery group, distributing the I/O load for the 
NSD/vdisks that are hosted on the RG.  (The minimum disks for a single RG 
config is 12, and you meet that easily.)


https://www.ibm.com/support/knowledgecenter/STXKQY_ECE_5.0.3/com.ibm.spectrum.scale.ece.v5r03.doc/b1lece_plan_recommendations.htm
outlines the recommendations for raidCode protection.  Your configuration (4 
nodes) would use vdisks with 4+3P, which gives you a slightly better capacity 
yield than RAID10 would, but with much better recovery characteristics:


·         No single failed node will result in a down system NSD.

·         No single drive failure will require a critical priority rebuild, and 
can be handled in the background without killing performance.

So from that perspective, ECE is a win here and avoids a problem with the 
non-ECE, shared-nothing designs: the manual “mmchdisk <fsname> start -a” 
operation that is needed after any traditional shared-nothing metadata NSD goes 
offline to bring it back and protect against further failures.

Despite the operational challenges of the non-ECE design, it can sometimes 
survive two server failures (if replication factor is 3 and the filesystem 
descriptor quorum wasn’t lost by the two failures) which a 4 node ECE cluster 
cannot.  Given that the world is complex and unexpected things can happen, I’d 
personally recommend redistributing the 24 disks across 6 servers if you can, 
so that the design could always survive 2 node failures.  I’ve run this design 
and it’s fairly robust.

In any event, you should of course test the failure scenarios yourself before 
going into production to validate them and familiarize yourself with the 
process.  And a special note on ECE: due to the cooperative nature at the pdisk 
level, the network between the servers in the RG should be as reliable as 
possible and any network redundancy should also be tested ahead of time.

-Paul

From: [email protected] 
<[email protected]> On Behalf Of David Johnson
Sent: Tuesday, July 30, 2019 7:46 AM
To: gpfsug main discussion list <[email protected]>
Subject: Re: [gpfsug-discuss] Building GPFS filesystem system data pool on 
shared nothing NVMe drives


This message was sent by an external party.

Can we confirm the requirement for disks per RG?  I have 4 RG, but only 6 x 3TB 
NVMe drives per box.


On Jul 29, 2019, at 1:34 PM, Luis Bolinches 
<[email protected]<mailto:[email protected]>> wrote:

Hi, from phone so sorry for typos.

I really think you should look into Spectrum Scale Erasure Code Edition (ECE) 
for this.

Sure you could do a RAID on each node as you mention here but that sounds like 
a lot of waste to me on storage capacity. Not to forget you get other goodies 
like end to end checksum and rapid rebuilds with ECE, among others.

Four servers is the minimum requirement for ECE (4+3p) and from top of my head 
12 disk per RG, you are fine with both requirements.

There is a presentation on ECE on the user group web page from London May 2019 
were we talk about ECE.

And the ibm page of the product 
https://www.ibm.com/support/knowledgecenter/STXKQY_ECE_5.0.3/com.ibm.spectrum.scale.ece.v5r03.doc/b1lece_intro.htm
--
Cheers

El 29 jul 2019, a las 19:06, David Johnson 
<[email protected]<mailto:[email protected]>> escribió:
We are planning a 5.0.x upgrade onto new hardware to make use of the new 5.x 
GPFS features.
The goal is to use up to four NSD nodes for metadata, each one with 6 NVMe 
drives (to be determined
whether we use Intel VROC for raid 5 or raid 1, or just straight disks).

So questions —
Has anyone done system pool on shared nothing cluster?  How did you set it up?
With default metadata replication set at 3, can you make use of four NSD nodes 
effectively?
How would one design the location vectors and failure groups so that the system 
metadata is
spread evenly across the four servers?

Thanks,
— ddj
Dave Johnson
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<http://spectrumscale.org/>
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Ellei edellä ole toisin mainittu: / Unless stated otherwise above:
Oy IBM Finland Ab
PL 265, 00101 Helsinki, Finland
Business ID, Y-tunnus: 0195876-3
Registered in Finland
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<http://spectrumscale.org>
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] Building GPFS filesystem system data pool on shared nothing NVMe drives

Reply via email to