Re: [gpfsug-discuss] RAID type for system pool

Aaron Knister Thu, 06 Sep 2018 10:06:50 -0700

Answers inline based on my recollection of experiences we've had here:


On 9/6/18 12:19 PM, Bryan Banister wrote:

I have questions about how the GPFS metadata replication of 3 works.

 1. Is it basically the same as replication of 2 but just have one more
    copy, making recovery much more likely?


That's my understanding.

 2. If there is nothing that is checking that the data was correctly
    read off of the device (e.g. CRC checking ON READS like the DDNs do,
    T10PI or Data Integrity Field) then how does GPFS handle a corrupted
    read of the data?
    - unlikely with SSD but head could be off on a NLSAS read, no
    errors, but you get some garbage instead, plus no auto retries


The inode itself is checksummed:

# /usr/lpp/mmfs/bin/tsdbfs mysuperawesomespacefs
Enter command or null to read next sector.  Type ? for help.
inode 20087366
Inode 20087366 [20087366] snap 0 (index 582 in block 9808):
  Inode address: 30:263275078 32:263264838 size 512 nAddrs 32
  indirectionLevel=3 status=USERFILE
  objectVersion=49352 generation=0x2B519B3 nlink=1
  owner uid=8675309 gid=999 mode=0200100600: -rw-------
  blocksize code=5 (32 subblocks)
  lastBlockSubblocks=1
  checksum=0xF2EF3427 is Valid
...
  Disk pointers [32]:
    0:  31:217629376    1:  30:217632960    2: (null)         ...
   31: (null)

as are indirect blocks (I'm sure that's not an exhaustive list ofchecksummed metadata structures):


ind 31:217629376
Indirect block starting in sector 31:217629376:
  magic=0x112DF307 generation=0x2B519B3 blockNum=0 inodeNum=20087366
  indirection level=2
  checksum=0x6BDAA92A
  CalcChecksum(0x5B6DC9FC000, 32768, 20)=0x6BDAA92A
  Data pointers:

 3. Does GPFS read at least two of the three replicas and compares them
    to ensure the data is correct?
    - expensive operation, so very unlikely

I don't know, but I do know it verifies the checksum and I believe ifthat's wrong it will try another replica.

 4. If not reading multiple replicas for comparison, are reads round
    robin across all three copies?

I feel like we see pretty even distribution of reads across all replicasof our metadata LUNs, although this is looking overall at the arraylevel so it may be a red herring.

 5. If one replica is corrupted (bad blocks) what does GPFS do to
    recover this metadata copy?  Is this automatic or does this require
    a manual `mmrestripefs -c` operation or something?
    - If not, seems like a pretty simple idea and maybe an RFE worthy
    submission

My experience has been it will attempt to correct it (and maybe log anfsstruct error?). This was in the 3.5 days, though.

 6. Would the idea of an option to run “background scrub/verifies” of
    the data/metadata be worthwhile to ensure no hidden bad blocks?
    - Using QoS this should be relatively painless

If you don't have array-level background scrubbing, this is what I'dsuggest. (e.g. mmrestripefs -c --metadata-only).

 7. With a drive failure do you have to delete the NSD from the file
    system and cluster, recreate the NSD, add it back to the FS, then
    again run the `mmrestripefs -c` operation to restore the replication?
    - As Kevin mentions this will end up being a FULL file system scan
    vs. a block-based scan and replication.  That could take a long time
    depending on number of inodes and type of storage!

Thanks for any insight,

-Bryan
*From:* [email protected]<[email protected]> *On Behalf Of *Buterbaugh,Kevin L
*Sent:* Thursday, September 6, 2018 9:59 AM
*To:* gpfsug main discussion list <[email protected]>
*Subject:* Re: [gpfsug-discuss] RAID type for system pool

/Note: External Email/

------------------------------------------------------------------------

Hi All,
Wow - my query got more responses than I expected and my sincere thanksto all who took the time to respond!
At this point in time we do have two GPFS filesystems … one which isbasically “/home” and some software installations and the other which is“/scratch” and “/data” (former backed up, latter not). Both of themhave their metadata on SSDs set up as RAID 1 mirrors and replication setto two. But at this point in time all of the SSDs are in a singlestorage array (albeit with dual redundant controllers) … so the storagearray itself is my only SPOF.
As part of the hardware purchase we are in the process of making we willbe buying a 2nd storage array that can house 2.5” SSDs. Therefore, wewill be splitting our SSDs between chassis and eliminating that lastSPOF. Of course, this includes the new SSDs we are getting for our new/home filesystem.
Our plan right now is to buy 10 SSDs, which will allow us to test 3configurations:
1) two 4+1P RAID 5 LUNs split up into a total of 8 LV’s (with each of my8 NSD servers as primary for one of those LV’s and the other 7 asbackups) and GPFS metadata replication set to 2.
2) four RAID 1 mirrors (which obviously leaves 2 SSDs unused) and GPFSmetadata replication set to 2. This would mean that only 4 of my 8 NSDservers would be a primary.
3) nine RAID 0 / bare drives with GPFS metadata replication set to 3(which leaves 1 SSD unused). All 8 NSD servers primary for one SSD and1 serving up two.
The responses I received concerning RAID 5 and performance were not asurprise to me. The main advantage that option gives is the most usablestorage space for the money (in fact, it gives us way more storage spacethan we currently need) … but if it tanks performance, then that’s adeal breaker.
Personally, I like the four RAID 1 mirrors config like we’ve been usingfor years, but it has the disadvantage of giving us the least usablestorage space … that config would give us the minimum we need for rightnow, but doesn’t really allow for much future growth.
I have no experience with metadata replication of 3 (but had actuallythought of that option, so feel good that others suggested it) so option3 will be a brand new experience for us. It is the most optimal interms of meeting current needs plus allowing for future growth withoutgiving us way more space than we are likely to need). I will be curiousto see how long it takes GPFS to re-replicate the data when we simulatea drive failure as opposed to how long a RAID rebuild takes.
I am a big believer in Murphy’s Law (Sunday I paid off a bill, Wednesdaymy refrigerator died!) … and also believe that the definition of apessimist is “someone with experience” <grin> … so we will definitelynot set GPFS metadata replication to less than two, nor will we usenon-Enterprise class SSDs for metadata … but I do still appreciate thesuggestions.
If there is interest, I will report back on our findings. If anyone hasany additional thoughts or suggestions, I’d also appreciate hearingthem. Again, thank you!
Kevin

—

Kevin Buterbaugh - Senior System Administrator

Vanderbilt University - Advanced Computing Center for Research and Education
[email protected]<mailto:[email protected]> - (615)875-9633
------------------------------------------------------------------------
Note: This email is for the confidential use of the named addressee(s)only and may contain proprietary, confidential, or privilegedinformation and/or personal data. If you are not the intended recipient,you are hereby notified that any review, dissemination, or copying ofthis email is strictly prohibited, and requested to notify the senderimmediately and destroy this email and any attachments. Emailtransmission cannot be guaranteed to be secure or error-free. TheCompany, therefore, does not make any guarantees as to the completenessor accuracy of this email or any attachments. This email is forinformational purposes only and does not constitute a recommendation,offer, request, or solicitation of any kind to buy, sell, subscribe,redeem, or perform any type of transaction of a financial product.Personal data, as defined by applicable data privacy laws, contained inthis email may be processed by the Company, and any of its affiliated orrelated companies, for potential ongoing compliance and/orbusiness-related purposes. You may have rights regarding your personaldata; for information on exercising these rights or the Company’streatment of personal data, please email [email protected].
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] RAID type for system pool

Reply via email to