I thought reads were always round robin's (in some form) unless you set readreplicapolicy.
And I thought with fsstruct you had to use mmfsck offline to fix. Simon ________________________________________ From: [email protected] [[email protected]] on behalf of Aaron Knister [[email protected]] Sent: 06 September 2018 18:06 To: [email protected] Subject: Re: [gpfsug-discuss] RAID type for system pool Answers inline based on my recollection of experiences we've had here: On 9/6/18 12:19 PM, Bryan Banister wrote: > I have questions about how the GPFS metadata replication of 3 works. > > 1. Is it basically the same as replication of 2 but just have one more > copy, making recovery much more likely? That's my understanding. > 2. If there is nothing that is checking that the data was correctly > read off of the device (e.g. CRC checking ON READS like the DDNs do, > T10PI or Data Integrity Field) then how does GPFS handle a corrupted > read of the data? > - unlikely with SSD but head could be off on a NLSAS read, no > errors, but you get some garbage instead, plus no auto retries The inode itself is checksummed: # /usr/lpp/mmfs/bin/tsdbfs mysuperawesomespacefs Enter command or null to read next sector. Type ? for help. inode 20087366 Inode 20087366 [20087366] snap 0 (index 582 in block 9808): Inode address: 30:263275078 32:263264838 size 512 nAddrs 32 indirectionLevel=3 status=USERFILE objectVersion=49352 generation=0x2B519B3 nlink=1 owner uid=8675309 gid=999 mode=0200100600: -rw------- blocksize code=5 (32 subblocks) lastBlockSubblocks=1 checksum=0xF2EF3427 is Valid ... Disk pointers [32]: 0: 31:217629376 1: 30:217632960 2: (null) ... 31: (null) as are indirect blocks (I'm sure that's not an exhaustive list of checksummed metadata structures): ind 31:217629376 Indirect block starting in sector 31:217629376: magic=0x112DF307 generation=0x2B519B3 blockNum=0 inodeNum=20087366 indirection level=2 checksum=0x6BDAA92A CalcChecksum(0x5B6DC9FC000, 32768, 20)=0x6BDAA92A Data pointers: > 3. Does GPFS read at least two of the three replicas and compares them > to ensure the data is correct? > - expensive operation, so very unlikely I don't know, but I do know it verifies the checksum and I believe if that's wrong it will try another replica. > 4. If not reading multiple replicas for comparison, are reads round > robin across all three copies? I feel like we see pretty even distribution of reads across all replicas of our metadata LUNs, although this is looking overall at the array level so it may be a red herring. > 5. If one replica is corrupted (bad blocks) what does GPFS do to > recover this metadata copy? Is this automatic or does this require > a manual `mmrestripefs -c` operation or something? > - If not, seems like a pretty simple idea and maybe an RFE worthy > submission My experience has been it will attempt to correct it (and maybe log an fsstruct error?). This was in the 3.5 days, though. > 6. Would the idea of an option to run “background scrub/verifies” of > the data/metadata be worthwhile to ensure no hidden bad blocks? > - Using QoS this should be relatively painless If you don't have array-level background scrubbing, this is what I'd suggest. (e.g. mmrestripefs -c --metadata-only). > 7. With a drive failure do you have to delete the NSD from the file > system and cluster, recreate the NSD, add it back to the FS, then > again run the `mmrestripefs -c` operation to restore the replication? > - As Kevin mentions this will end up being a FULL file system scan > vs. a block-based scan and replication. That could take a long time > depending on number of inodes and type of storage! > > Thanks for any insight, > > -Bryan > > *From:* [email protected] > <[email protected]> *On Behalf Of *Buterbaugh, > Kevin L > *Sent:* Thursday, September 6, 2018 9:59 AM > *To:* gpfsug main discussion list <[email protected]> > *Subject:* Re: [gpfsug-discuss] RAID type for system pool > > /Note: External Email/ > > ------------------------------------------------------------------------ > > Hi All, > > Wow - my query got more responses than I expected and my sincere thanks > to all who took the time to respond! > > At this point in time we do have two GPFS filesystems … one which is > basically “/home” and some software installations and the other which is > “/scratch” and “/data” (former backed up, latter not). Both of them > have their metadata on SSDs set up as RAID 1 mirrors and replication set > to two. But at this point in time all of the SSDs are in a single > storage array (albeit with dual redundant controllers) … so the storage > array itself is my only SPOF. > > As part of the hardware purchase we are in the process of making we will > be buying a 2nd storage array that can house 2.5” SSDs. Therefore, we > will be splitting our SSDs between chassis and eliminating that last > SPOF. Of course, this includes the new SSDs we are getting for our new > /home filesystem. > > Our plan right now is to buy 10 SSDs, which will allow us to test 3 > configurations: > > 1) two 4+1P RAID 5 LUNs split up into a total of 8 LV’s (with each of my > 8 NSD servers as primary for one of those LV’s and the other 7 as > backups) and GPFS metadata replication set to 2. > > 2) four RAID 1 mirrors (which obviously leaves 2 SSDs unused) and GPFS > metadata replication set to 2. This would mean that only 4 of my 8 NSD > servers would be a primary. > > 3) nine RAID 0 / bare drives with GPFS metadata replication set to 3 > (which leaves 1 SSD unused). All 8 NSD servers primary for one SSD and > 1 serving up two. > > The responses I received concerning RAID 5 and performance were not a > surprise to me. The main advantage that option gives is the most usable > storage space for the money (in fact, it gives us way more storage space > than we currently need) … but if it tanks performance, then that’s a > deal breaker. > > Personally, I like the four RAID 1 mirrors config like we’ve been using > for years, but it has the disadvantage of giving us the least usable > storage space … that config would give us the minimum we need for right > now, but doesn’t really allow for much future growth. > > I have no experience with metadata replication of 3 (but had actually > thought of that option, so feel good that others suggested it) so option > 3 will be a brand new experience for us. It is the most optimal in > terms of meeting current needs plus allowing for future growth without > giving us way more space than we are likely to need). I will be curious > to see how long it takes GPFS to re-replicate the data when we simulate > a drive failure as opposed to how long a RAID rebuild takes. > > I am a big believer in Murphy’s Law (Sunday I paid off a bill, Wednesday > my refrigerator died!) … and also believe that the definition of a > pessimist is “someone with experience” <grin> … so we will definitely > not set GPFS metadata replication to less than two, nor will we use > non-Enterprise class SSDs for metadata … but I do still appreciate the > suggestions. > > If there is interest, I will report back on our findings. If anyone has > any additional thoughts or suggestions, I’d also appreciate hearing > them. Again, thank you! > > Kevin > > — > > Kevin Buterbaugh - Senior System Administrator > > Vanderbilt University - Advanced Computing Center for Research and Education > > [email protected] > <mailto:[email protected]> - (615)875-9633 > > > ------------------------------------------------------------------------ > > Note: This email is for the confidential use of the named addressee(s) > only and may contain proprietary, confidential, or privileged > information and/or personal data. If you are not the intended recipient, > you are hereby notified that any review, dissemination, or copying of > this email is strictly prohibited, and requested to notify the sender > immediately and destroy this email and any attachments. Email > transmission cannot be guaranteed to be secure or error-free. The > Company, therefore, does not make any guarantees as to the completeness > or accuracy of this email or any attachments. This email is for > informational purposes only and does not constitute a recommendation, > offer, request, or solicitation of any kind to buy, sell, subscribe, > redeem, or perform any type of transaction of a financial product. > Personal data, as defined by applicable data privacy laws, contained in > this email may be processed by the Company, and any of its affiliated or > related companies, for potential ongoing compliance and/or > business-related purposes. You may have rights regarding your personal > data; for information on exercising these rights or the Company’s > treatment of personal data, please email [email protected]. > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
