Re: [gpfsug-discuss] RAID type for system pool

Aaron Knister Wed, 12 Sep 2018 15:24:23 -0700

It's a good question, Simon. I don't know the answer. At least, when Istarted composing this e-mail what, 5 days ago now, I didn't.

I did a little test using dd to write directly to the NSD (not inproduction just to be clear...I've got co-workers on this list ;-) ).


Here's a partial dump of the inode prior:
# /usr/lpp/mmfs/bin/tsdbfs fs1 inode 23808
Inode 23808 [23808] snap 0 (index 1280 in block 11):
  Inode address: 1:4207872 2:4207872 size 512 nAddrs 25
  indirectionLevel=INDIRECT status=USERFILE
  objectVersion=103 generation=0x6E256E16 nlink=1
  owner uid=0 gid=0 mode=0200100644: -rw-r--r--
  blocksize code=5 (32 subblocks)
  lastBlockSubblocks=32
  checksum=0xF74A31AA is Valid

This is me writing junk to that sector of the NSD:
# dd if=/dev/urandom bs=512 of=/dev/sda seek=4207872 count=1

Post-junkifying:
# /usr/lpp/mmfs/bin/tsdbfs fs1 sector 1:4207872
Contents of 1 sector(s) from 1:4207872 = 0x1:403500, width1
0000000000000000: 4FA27C86 5D2076BB 6CD011DE D582F7CE  *O.|.].v.l.......*
0000000000000010: 60A708F1 A3C60FCD 7D796E3D CC97F586  *`.......}yn=....*
0000000000000020: 57B643A7 FABD7235 A2BD9B75 6DDA0771  *W.C...r5...um..q*
0000000000000030: 6A818411 0D59D1D3 2C4C7F39 2B2B529D  *j....Y..,L.9++R.*
0000000000000040: 9AE06C7D A8FB1DC9 7E783DB4 90A9E9E4  *..l}....~x=.....*
0000000000000050: B2D0E9C9 CC7FEBC0 85F23DF8 F18D19C0  *..........=.....*
0000000000000060: DA9C817C D20C0FB2 F30AAF55 C86D4155  *...|.......U.mAU*

Dump of the inode post-junkifying:
# /usr/lpp/mmfs/bin/tsdbfs fs1 inode 23808
Inode 23808 [23808] snap 0 (index 1280 in block 11):
  Inode address: 1:4207872 2:4207872 size 512 nAddrs 0
  indirectionLevel=13 status=4
  objectVersion=5738285791753303739 generation=0x9AE06C7D nlink=3955281023
  owner uid=2121809332 gid=-1867912732 mode=025076616711: prws--s--x
  flags set: exposed illCompressed dataUpdateMissRRPlus metaUpdateMiss
  blocksize code=8 (256 subblocks)
  lastBlockSubblocks=15582
  checksum=0xD582F7CE is INVALID (computed checksum=0x2A2FA283)

Attempts to access the file succeed but I get an fsstruct error:

# /usr/lpp/mmfs/samples/debugtools/fsstructlx.awk /var/log/messages

09/12@17:38:03 gpfs-adm1 FSSTRUCT fs1 108 FSErrValidatetype=inode da=00000001:0000000000403500(1:4207872) sectors=0001repda=[nVal=2 00000001:0000000000403500(1:4207872)00000002:0000000000403500(2:4207872)] data=(len=00000200) 4FA27C865D2076BB 6CD011DE D582F7CE 60A708F1 A3C60FCD 7D796E3D CC97F586 57B643A7FABD7235 A2BD9B75 6DDA0771 6A818411 0D59D1D3 2C4C7F39 2B2B529D 9AE06C7DA8FB1DC9 7E783DB4 90A9E9E4 B2D0E9C9 CC7FEBC0 85F23DF8 F18D19C0 DA9C817CD20C0FB2 F30AAF55


It *didn't* automatically repair it, it seems. The restripe did pick it up:
# /usr/lpp/mmfs/bin/mmrestripefs fs1 -c --read-only --metadata-only
Scanning file system metadata, phase 1 ...

Inode 0 [fileset 0, snapshot 0 ] has mismatch in replicated disk address1:4206592 2:4206592

Scan completed successfully.
Scanning file system metadata, phase 2 ...
Scan completed successfully.
Scanning file system metadata, phase 3 ...
Scan completed successfully.
Scanning file system metadata, phase 4 ...
Scan completed successfully.
Scanning user file metadata ...

100.00 % complete on Sun Aug 26 18:10:36 2018 ( 69632 inodes withtotal 406 MB data processed)

Scan completed successfully.

I ran this to fix it:
# /usr/lpp/mmfs/bin/mmrestripefs fs1 -c --metadata-only

And things appear better afterwards:
# /usr/lpp/mmfs/bin/tsdbfs fs1 inode 23808
Inode 23808 [23808] snap 0 (index 1280 in block 11):
  Inode address: 1:4207872 2:4207872 size 512 nAddrs 25
  indirectionLevel=INDIRECT status=USERFILE
  objectVersion=103 generation=0x6E256E16 nlink=1
  owner uid=0 gid=0 mode=0200100644: -rw-r--r--
  blocksize code=5 (32 subblocks)
  lastBlockSubblocks=32
  checksum=0xF74A31AA is Valid

This is with 4.2.3-10.

-Aaron

On 9/6/18 1:49 PM, Simon Thompson wrote:

I thought reads were always round robin's (in some form) unless you set 
readreplicapolicy.

And I thought with fsstruct you had to use mmfsck offline to fix.

Simon
________________________________________
From: [email protected] 
[[email protected]] on behalf of Aaron Knister 
[[email protected]]
Sent: 06 September 2018 18:06
To: [email protected]
Subject: Re: [gpfsug-discuss] RAID type for system pool

Answers inline based on my recollection of experiences we've had here:

On 9/6/18 12:19 PM, Bryan Banister wrote:

I have questions about how the GPFS metadata replication of 3 works.

  1. Is it basically the same as replication of 2 but just have one more
     copy, making recovery much more likely?


That's my understanding.

  2. If there is nothing that is checking that the data was correctly
     read off of the device (e.g. CRC checking ON READS like the DDNs do,
     T10PI or Data Integrity Field) then how does GPFS handle a corrupted
     read of the data?
     - unlikely with SSD but head could be off on a NLSAS read, no
     errors, but you get some garbage instead, plus no auto retries


The inode itself is checksummed:

# /usr/lpp/mmfs/bin/tsdbfs mysuperawesomespacefs
Enter command or null to read next sector.  Type ? for help.
inode 20087366
Inode 20087366 [20087366] snap 0 (index 582 in block 9808):
    Inode address: 30:263275078 32:263264838 size 512 nAddrs 32
    indirectionLevel=3 status=USERFILE
    objectVersion=49352 generation=0x2B519B3 nlink=1
    owner uid=8675309 gid=999 mode=0200100600: -rw-------
    blocksize code=5 (32 subblocks)
    lastBlockSubblocks=1
    checksum=0xF2EF3427 is Valid
...
    Disk pointers [32]:
      0:  31:217629376    1:  30:217632960    2: (null)         ...
     31: (null)

as are indirect blocks (I'm sure that's not an exhaustive list of
checksummed metadata structures):

ind 31:217629376
Indirect block starting in sector 31:217629376:
    magic=0x112DF307 generation=0x2B519B3 blockNum=0 inodeNum=20087366
    indirection level=2
    checksum=0x6BDAA92A
    CalcChecksum(0x5B6DC9FC000, 32768, 20)=0x6BDAA92A
    Data pointers:

  3. Does GPFS read at least two of the three replicas and compares them
     to ensure the data is correct?
     - expensive operation, so very unlikely


I don't know, but I do know it verifies the checksum and I believe if
that's wrong it will try another replica.

  4. If not reading multiple replicas for comparison, are reads round
     robin across all three copies?


I feel like we see pretty even distribution of reads across all replicas
of our metadata LUNs, although this is looking overall at the array
level so it may be a red herring.

  5. If one replica is corrupted (bad blocks) what does GPFS do to
     recover this metadata copy?  Is this automatic or does this require
     a manual `mmrestripefs -c` operation or something?
     - If not, seems like a pretty simple idea and maybe an RFE worthy
     submission


My experience has been it will attempt to correct it (and maybe log an
fsstruct error?). This was in the 3.5 days, though.

  6. Would the idea of an option to run “background scrub/verifies” of
     the data/metadata be worthwhile to ensure no hidden bad blocks?
     - Using QoS this should be relatively painless


If you don't have array-level background scrubbing, this is what I'd
suggest. (e.g. mmrestripefs -c --metadata-only).

  7. With a drive failure do you have to delete the NSD from the file
     system and cluster, recreate the NSD, add it back to the FS, then
     again run the `mmrestripefs -c` operation to restore the replication?
     - As Kevin mentions this will end up being a FULL file system scan
     vs. a block-based scan and replication.  That could take a long time
     depending on number of inodes and type of storage!

Thanks for any insight,

-Bryan

*From:* [email protected]
<[email protected]> *On Behalf Of *Buterbaugh,
Kevin L
*Sent:* Thursday, September 6, 2018 9:59 AM
*To:* gpfsug main discussion list <[email protected]>
*Subject:* Re: [gpfsug-discuss] RAID type for system pool

/Note: External Email/

------------------------------------------------------------------------

Hi All,

Wow - my query got more responses than I expected and my sincere thanks
to all who took the time to respond!

At this point in time we do have two GPFS filesystems … one which is
basically “/home” and some software installations and the other which is
“/scratch” and “/data” (former backed up, latter not).  Both of them
have their metadata on SSDs set up as RAID 1 mirrors and replication set
to two.  But at this point in time all of the SSDs are in a single
storage array (albeit with dual redundant controllers) … so the storage
array itself is my only SPOF.

As part of the hardware purchase we are in the process of making we will
be buying a 2nd storage array that can house 2.5” SSDs.  Therefore, we
will be splitting our SSDs between chassis and eliminating that last
SPOF.  Of course, this includes the new SSDs we are getting for our new
/home filesystem.

Our plan right now is to buy 10 SSDs, which will allow us to test 3
configurations:

1) two 4+1P RAID 5 LUNs split up into a total of 8 LV’s (with each of my
8 NSD servers as primary for one of those LV’s and the other 7 as
backups) and GPFS metadata replication set to 2.

2) four RAID 1 mirrors (which obviously leaves 2 SSDs unused) and GPFS
metadata replication set to 2.  This would mean that only 4 of my 8 NSD
servers would be a primary.

3) nine RAID 0 / bare drives with GPFS metadata replication set to 3
(which leaves 1 SSD unused).  All 8 NSD servers primary for one SSD and
1 serving up two.

The responses I received concerning RAID 5 and performance were not a
surprise to me.  The main advantage that option gives is the most usable
storage space for the money (in fact, it gives us way more storage space
than we currently need) … but if it tanks performance, then that’s a
deal breaker.

Personally, I like the four RAID 1 mirrors config like we’ve been using
for years, but it has the disadvantage of giving us the least usable
storage space … that config would give us the minimum we need for right
now, but doesn’t really allow for much future growth.

I have no experience with metadata replication of 3 (but had actually
thought of that option, so feel good that others suggested it) so option
3 will be a brand new experience for us.  It is the most optimal in
terms of meeting current needs plus allowing for future growth without
giving us way more space than we are likely to need).  I will be curious
to see how long it takes GPFS to re-replicate the data when we simulate
a drive failure as opposed to how long a RAID rebuild takes.

I am a big believer in Murphy’s Law (Sunday I paid off a bill, Wednesday
my refrigerator died!) … and also believe that the definition of a
pessimist is “someone with experience” <grin> … so we will definitely
not set GPFS metadata replication to less than two, nor will we use
non-Enterprise class SSDs for metadata … but I do still appreciate the
suggestions.

If there is interest, I will report back on our findings.  If anyone has
any additional thoughts or suggestions, I’d also appreciate hearing
them.  Again, thank you!

Kevin

—

Kevin Buterbaugh - Senior System Administrator

Vanderbilt University - Advanced Computing Center for Research and Education

[email protected]
<mailto:[email protected]> - (615)875-9633


------------------------------------------------------------------------

Note: This email is for the confidential use of the named addressee(s)
only and may contain proprietary, confidential, or privileged
information and/or personal data. If you are not the intended recipient,
you are hereby notified that any review, dissemination, or copying of
this email is strictly prohibited, and requested to notify the sender
immediately and destroy this email and any attachments. Email
transmission cannot be guaranteed to be secure or error-free. The
Company, therefore, does not make any guarantees as to the completeness
or accuracy of this email or any attachments. This email is for
informational purposes only and does not constitute a recommendation,
offer, request, or solicitation of any kind to buy, sell, subscribe,
redeem, or perform any type of transaction of a financial product.
Personal data, as defined by applicable data privacy laws, contained in
this email may be processed by the Company, and any of its affiliated or
related companies, for potential ongoing compliance and/or
business-related purposes. You may have rights regarding your personal
data; for information on exercising these rights or the Company’s
treatment of personal data, please email [email protected].


_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] RAID type for system pool

Reply via email to