so as you already figured out.. as long your 10GbE is saturated.. it will take time to transfer all the data over the wire..
"all the data" .. is the key here.. in older releases /file system versions..we only flagged "data update miss" or "meta data update miss" in the MD of a file, when something was written to the file during some NSDs 're missing/down...
means.. when all the disks are back " active "  -.we scan the file systmes meta data to find all files with one of these flags .. and resync  the  data..
means, even if you only change some Bytes of a 10 TB file .. the whole 10TB file needs to be resynced

with 4.2 we introduced "rapid repair" .. (note, you need to be on that demon level and you need to update you file system version)
So with Rapid Repair.. we not only flag MD or D update miss, we write an extra bit on every disk address, which has changed.. so that now... when after a side failure (or disk outage)  we are able to find all files , which has changed (like before in the good old days) but now - we know, which disk address has changed and we just need to sync that changed data  
in other words.. if you 've changed 1MB of a 10 TB file .. only that 1 MB is re-synced

you can check, if rapid repair is in place by mmlsfs command (enable disable by mmchfs )

of course.. if everything( every disk address)  has changed during our NSD've been down... rapid repair won't help ..but depending on the amount of your changed data rate .. it will definitively  shorten your sync times in the future ..

cheers

Mit freundlichen Grüßen / Kind regards

 
Olaf Weiser

EMEA Storage Competence Center Mainz, German / IBM Systems, Storage Platform,
-------------------------------------------------------------------------------------------------------------------------------------------
IBM Deutschland
IBM Allee 1
71139 Ehningen
Phone: +49-170-579-44-66
E-Mail: [email protected]
-------------------------------------------------------------------------------------------------------------------------------------------
IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter
Geschäftsführung: Martina Koederitz (Vorsitzende), Susanne Peter, Norbert Janzen, Dr. Christian Keller, Ivo Koerner, Markus Koerner
Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 14562 / WEEE-Reg.-Nr. DE 99369940




From:        Valdis Kletnieks <[email protected]>
To:        [email protected]
Date:        11/18/2016 08:06 PM
Subject:        [gpfsug-discuss] mmchdisk performance/behavior in a stretch cluster        config?
Sent by:        [email protected]




So as a basis for our archive solution, we're using a GPFS cluster
in a stretch configuration, with 2 sites separated by about 20ms worth
of 10G link.  Each end has 2 protocol servers doing NFS and 3 NSD servers.
Identical disk arrays and LTFS/EE at both ends, and all metadata and
userdata are replicated to both sites.

We had a fiber issue for about 8 hours yesterday, and as expected (since there
are only 5 quorum nodes, 3 local and 2 at the far end) the far end fell off the
cluster and down'ed all the NSDs on the remote arrays.

There's about 123T of data at each end, 6 million files in there so far.

So after the fiber came back up after a several-hour downtime, I
did the 'mmchdisk archive start -a'.  That was at 17:45 yesterday.
I'm now 20 hours in, at:

 62.15 % complete on Fri Nov 18 13:52:59 2016  (   4768429 inodes with total  173675926 MB data processed)
 62.17 % complete on Fri Nov 18 13:53:20 2016  (   4769416 inodes with total  173710731 MB data processed)
 62.18 % complete on Fri Nov 18 13:53:40 2016  (   4772481 inodes with total  173762456 MB data processed)

network statistics indicate that the 3 local NSDs are all tossing out
packets at about 400Mbytes/second, which means the 10G pipe is pretty damned
close to totally packed full, and the 3 remotes are sending back ACKs
of all the data.

Rough back-of-envelop calculations indicate that (a) if I'm at 62% after
20 hours, it will take 30 hours to finish and (b) a 10G link takes about
29 hours at full blast to move 123T of data.  So it certainly *looks*
like it's resending everything.

And that's even though at least 100T of that 123T is test data that was
written by one of our users back on Nov 12/13, and thus theoretically *should*
already have been at the remote site.

Any ideas what's going on here?
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss




_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to