OK, so the ECE recovery group is the four NSD servers with the System storage pool disks, and somehow I have to read the docs and find out how to define pdisks that spread the replication across the four servers, but three disks at a time. Three pdisks of 7 drives, three I can't do anything with, or are those for rebuilding space?
Can you provide me details of your six-node non-ECE configuration? Basically how the NSDs are defined... The remainder of our new filesystem will have a fast pool of 12 nodes of excelero, and 2Pb of spinning disks, so another possibility would be to license four more nodes and put the system pool under excelero. -- ddj > On Jul 30, 2019, at 8:19 AM, Sanchez, Paul <[email protected]> wrote: > > Hi David, > > In an ECE configuration, it would be typical to put all of the NVMe disks in > all 4 of your servers into a single recovery group. So in your case, all 24 > NVMe drives would be in one recovery group and the 4 servers would be “log > group” servers in the recovery group, distributing the I/O load for the > NSD/vdisks that are hosted on the RG. (The minimum disks for a single RG > config is 12, and you meet that easily.) > > https://www.ibm.com/support/knowledgecenter/STXKQY_ECE_5.0.3/com.ibm.spectrum.scale.ece.v5r03.doc/b1lece_plan_recommendations.htm > > <https://www.ibm.com/support/knowledgecenter/STXKQY_ECE_5.0.3/com.ibm.spectrum.scale.ece.v5r03.doc/b1lece_plan_recommendations.htm> > outlines the recommendations for raidCode protection. Your configuration (4 > nodes) would use vdisks with 4+3P, which gives you a slightly better capacity > yield than RAID10 would, but with much better recovery characteristics: > > · No single failed node will result in a down system NSD. > · No single drive failure will require a critical priority rebuild, > and can be handled in the background without killing performance. > > So from that perspective, ECE is a win here and avoids a problem with the > non-ECE, shared-nothing designs: the manual “mmchdisk <fsname> start -a” > operation that is needed after any traditional shared-nothing metadata NSD > goes offline to bring it back and protect against further failures. > > Despite the operational challenges of the non-ECE design, it can sometimes > survive two server failures (if replication factor is 3 and the filesystem > descriptor quorum wasn’t lost by the two failures) which a 4 node ECE cluster > cannot. Given that the world is complex and unexpected things can happen, > I’d personally recommend redistributing the 24 disks across 6 servers if you > can, so that the design could always survive 2 node failures. I’ve run this > design and it’s fairly robust. > > In any event, you should of course test the failure scenarios yourself before > going into production to validate them and familiarize yourself with the > process. And a special note on ECE: due to the cooperative nature at the > pdisk level, the network between the servers in the RG should be as reliable > as possible and any network redundancy should also be tested ahead of time. > > -Paul > > From: [email protected] > <mailto:[email protected]> > <[email protected] > <mailto:[email protected]>> On Behalf Of David Johnson > Sent: Tuesday, July 30, 2019 7:46 AM > To: gpfsug main discussion list <[email protected] > <mailto:[email protected]>> > Subject: Re: [gpfsug-discuss] Building GPFS filesystem system data pool on > shared nothing NVMe drives > > This message was sent by an external party. > > > Can we confirm the requirement for disks per RG? I have 4 RG, but only 6 x > 3TB NVMe drives per box. > > > On Jul 29, 2019, at 1:34 PM, Luis Bolinches <[email protected] > <mailto:[email protected]>> wrote: > > Hi, from phone so sorry for typos. > > I really think you should look into Spectrum Scale Erasure Code Edition (ECE) > for this. > > Sure you could do a RAID on each node as you mention here but that sounds > like a lot of waste to me on storage capacity. Not to forget you get other > goodies like end to end checksum and rapid rebuilds with ECE, among others. > > Four servers is the minimum requirement for ECE (4+3p) and from top of my > head 12 disk per RG, you are fine with both requirements. > > There is a presentation on ECE on the user group web page from London May > 2019 were we talk about ECE. > > And the ibm page of the product > https://www.ibm.com/support/knowledgecenter/STXKQY_ECE_5.0.3/com.ibm.spectrum.scale.ece.v5r03.doc/b1lece_intro.htm > > <https://www.ibm.com/support/knowledgecenter/STXKQY_ECE_5.0.3/com.ibm.spectrum.scale.ece.v5r03.doc/b1lece_intro.htm> > -- > Cheers > > El 29 jul 2019, a las 19:06, David Johnson <[email protected] > <mailto:[email protected]>> escribió: > > We are planning a 5.0.x upgrade onto new hardware to make use of the new 5.x > GPFS features. > The goal is to use up to four NSD nodes for metadata, each one with 6 NVMe > drives (to be determined > whether we use Intel VROC for raid 5 or raid 1, or just straight disks). > > So questions — > Has anyone done system pool on shared nothing cluster? How did you set it up? > With default metadata replication set at 3, can you make use of four NSD > nodes effectively? > How would one design the location vectors and failure groups so that the > system metadata is > spread evenly across the four servers? > > Thanks, > — ddj > Dave Johnson > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org <http://spectrumscale.org/> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > <http://gpfsug.org/mailman/listinfo/gpfsug-discuss> > > Ellei edellä ole toisin mainittu: / Unless stated otherwise above: > Oy IBM Finland Ab > PL 265, 00101 Helsinki, Finland > Business ID, Y-tunnus: 0195876-3 > Registered in Finland > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org <http://spectrumscale.org/> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > <http://gpfsug.org/mailman/listinfo/gpfsug-discuss> > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org <http://spectrumscale.org/> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
