Re: [gpfsug-discuss] Building GPFS filesystem system data pool on shared nothing NVMe drives

David Johnson Tue, 30 Jul 2019 06:04:19 -0700

OK, so the ECE recovery group is the four NSD servers with the System storage 
pool disks, and somehow I have to read the docs
and find out how to define pdisks that spread the replication across the four 
servers, but three disks at a time.  
Three pdisks of 7 drives, three I can't do anything with, or are those for 
rebuilding space?


Can you provide me details of your six-node non-ECE configuration?  Basically 
how the NSDs are defined...

The remainder of our new filesystem will have a fast pool of 12 nodes of 
excelero, and 2Pb of spinning disks, so another possibility
would be to license four more nodes and put the system pool under excelero.  

 -- ddj

> On Jul 30, 2019, at 8:19 AM, Sanchez, Paul <[email protected]> wrote:
> 
> Hi David,
>  
> In an ECE configuration, it would be typical to put all of the NVMe disks in 
> all 4 of your servers into a single recovery group.   So in your case, all 24 
> NVMe drives would be in one recovery group and the 4 servers would be “log 
> group” servers in the recovery group, distributing the I/O load for the 
> NSD/vdisks that are hosted on the RG.  (The minimum disks for a single RG 
> config is 12, and you meet that easily.)
>  
> https://www.ibm.com/support/knowledgecenter/STXKQY_ECE_5.0.3/com.ibm.spectrum.scale.ece.v5r03.doc/b1lece_plan_recommendations.htm
>  
> <https://www.ibm.com/support/knowledgecenter/STXKQY_ECE_5.0.3/com.ibm.spectrum.scale.ece.v5r03.doc/b1lece_plan_recommendations.htm>
> outlines the recommendations for raidCode protection.  Your configuration (4 
> nodes) would use vdisks with 4+3P, which gives you a slightly better capacity 
> yield than RAID10 would, but with much better recovery characteristics:
>  
> ·         No single failed node will result in a down system NSD.
> ·         No single drive failure will require a critical priority rebuild, 
> and can be handled in the background without killing performance.
>  
> So from that perspective, ECE is a win here and avoids a problem with the 
> non-ECE, shared-nothing designs: the manual “mmchdisk <fsname> start -a” 
> operation that is needed after any traditional shared-nothing metadata NSD 
> goes offline to bring it back and protect against further failures.
>  
> Despite the operational challenges of the non-ECE design, it can sometimes 
> survive two server failures (if replication factor is 3 and the filesystem 
> descriptor quorum wasn’t lost by the two failures) which a 4 node ECE cluster 
> cannot.  Given that the world is complex and unexpected things can happen, 
> I’d personally recommend redistributing the 24 disks across 6 servers if you 
> can, so that the design could always survive 2 node failures.  I’ve run this 
> design and it’s fairly robust.
>  
> In any event, you should of course test the failure scenarios yourself before 
> going into production to validate them and familiarize yourself with the 
> process.  And a special note on ECE: due to the cooperative nature at the 
> pdisk level, the network between the servers in the RG should be as reliable 
> as possible and any network redundancy should also be tested ahead of time.
>  
> -Paul
>  
> From: [email protected] 
> <mailto:[email protected]> 
> <[email protected] 
> <mailto:[email protected]>> On Behalf Of David Johnson
> Sent: Tuesday, July 30, 2019 7:46 AM
> To: gpfsug main discussion list <[email protected] 
> <mailto:[email protected]>>
> Subject: Re: [gpfsug-discuss] Building GPFS filesystem system data pool on 
> shared nothing NVMe drives
>  
> This message was sent by an external party.
> 
>  
> Can we confirm the requirement for disks per RG?  I have 4 RG, but only 6 x 
> 3TB NVMe drives per box.
> 
> 
> On Jul 29, 2019, at 1:34 PM, Luis Bolinches <[email protected] 
> <mailto:[email protected]>> wrote:
>  
> Hi, from phone so sorry for typos.  
>  
> I really think you should look into Spectrum Scale Erasure Code Edition (ECE) 
> for this. 
>  
> Sure you could do a RAID on each node as you mention here but that sounds 
> like a lot of waste to me on storage capacity. Not to forget you get other 
> goodies like end to end checksum and rapid rebuilds with ECE, among others. 
>  
> Four servers is the minimum requirement for ECE (4+3p) and from top of my 
> head 12 disk per RG, you are fine with both requirements. 
> 
> There is a presentation on ECE on the user group web page from London May 
> 2019 were we talk about ECE. 
>  
> And the ibm page of the product 
> https://www.ibm.com/support/knowledgecenter/STXKQY_ECE_5.0.3/com.ibm.spectrum.scale.ece.v5r03.doc/b1lece_intro.htm
>  
> <https://www.ibm.com/support/knowledgecenter/STXKQY_ECE_5.0.3/com.ibm.spectrum.scale.ece.v5r03.doc/b1lece_intro.htm>
> -- 
> Cheers
> 
> El 29 jul 2019, a las 19:06, David Johnson <[email protected] 
> <mailto:[email protected]>> escribió:
> 
> We are planning a 5.0.x upgrade onto new hardware to make use of the new 5.x 
> GPFS features.
> The goal is to use up to four NSD nodes for metadata, each one with 6 NVMe 
> drives (to be determined
> whether we use Intel VROC for raid 5 or raid 1, or just straight disks).  
> 
> So questions — 
> Has anyone done system pool on shared nothing cluster?  How did you set it up?
> With default metadata replication set at 3, can you make use of four NSD 
> nodes effectively?
> How would one design the location vectors and failure groups so that the 
> system metadata is
> spread evenly across the four servers?
> 
> Thanks,
> — ddj
> Dave Johnson
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org <http://spectrumscale.org/>
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss 
> <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
> 
> Ellei edellä ole toisin mainittu: / Unless stated otherwise above:
> Oy IBM Finland Ab
> PL 265, 00101 Helsinki, Finland
> Business ID, Y-tunnus: 0195876-3 
> Registered in Finland
> 
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org <http://spectrumscale.org/>
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss 
> <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>  
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org <http://spectrumscale.org/>
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss 
> <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] Building GPFS filesystem system data pool on shared nothing NVMe drives

Reply via email to