Re: [gpfsug-discuss] mmchdisk performance/behavior in a stretch cluster config?

2016-11-18 Thread Olaf Weiser
so as you already figured out.. as long
your 10GbE is saturated.. it will take time to transfer all the data over
the wire.. "all the data" .. is the key
here.. in older releases /file system versions..we only flagged "data
update miss" or "meta data update miss" in the MD of a file,
when something was written to the file during some NSDs 're missing/down...
means.. when all the disks are back
" active "  -.we scan the file systmes meta data to find
all files with one of these flags .. and resync  the  data..
means, even if you only change some
Bytes of a 10 TB file .. the whole 10TB file needs to be resynced with 4.2 we introduced "rapid repair"
.. (note, you need to be on that demon level and you need to update you
file system version) So with Rapid Repair.. we not only flag
MD or D update miss, we write an extra bit on every disk address, which
has changed.. so that now... when after a side failure (or disk outage)
 we are able to find all files , which has changed (like before in
the good old days) but now - we know, which disk address has changed and
we just need to sync that changed data  in other words.. if you 've changed
1MB of a 10 TB file .. only that 1 MB is re-syncedyou can check, if rapid repair is in
place by mmlsfs command (enable disable by mmchfs ) of course.. if everything( every disk
address)  has changed during our NSD've been down... rapid repair
won't help ..but depending on the amount of your changed data rate .. it
will definitively  shorten your sync times in the future .. cheersMit freundlichen Grüßen / Kind regards Olaf Weiser EMEA Storage Competence Center Mainz, German / IBM Systems, Storage Platform,---IBM DeutschlandIBM Allee 171139 EhningenPhone: +49-170-579-44-66E-Mail: olaf.wei...@de.ibm.com---IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin JetterGeschäftsführung: Martina Koederitz (Vorsitzende), Susanne Peter, Norbert
Janzen, Dr. Christian Keller, Ivo Koerner, Markus KoernerSitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart,
HRB 14562 / WEEE-Reg.-Nr. DE 99369940 From:      
 Valdis Kletnieks <valdis.kletni...@vt.edu>To:      
 gpfsug-discuss@spectrumscale.orgDate:      
 11/18/2016 08:06 PMSubject:    
   [gpfsug-discuss]
mmchdisk performance/behavior in a stretch cluster      
 config?Sent by:    
   gpfsug-discuss-boun...@spectrumscale.orgSo as a basis for our archive solution, we're using
a GPFS clusterin a stretch configuration, with 2 sites separated by about 20ms worthof 10G link.  Each end has 2 protocol servers doing NFS and 3 NSD
servers.Identical disk arrays and LTFS/EE at both ends, and all metadata anduserdata are replicated to both sites.We had a fiber issue for about 8 hours yesterday, and as expected (since
thereare only 5 quorum nodes, 3 local and 2 at the far end) the far end fell
off thecluster and down'ed all the NSDs on the remote arrays.There's about 123T of data at each end, 6 million files in there so far.So after the fiber came back up after a several-hour downtime, Idid the 'mmchdisk archive start -a'.  That was at 17:45 yesterday.I'm now 20 hours in, at:  62.15 % complete on Fri Nov 18 13:52:59 2016  (   4768429
inodes with total  173675926 MB data processed)  62.17 % complete on Fri Nov 18 13:53:20 2016  (   4769416
inodes with total  173710731 MB data processed)  62.18 % complete on Fri Nov 18 13:53:40 2016  (   4772481
inodes with total  173762456 MB data processed)network statistics indicate that the 3 local NSDs are all tossing outpackets at about 400Mbytes/second, which means the 10G pipe is pretty damnedclose to totally packed full, and the 3 remotes are sending back ACKsof all the data.Rough back-of-envelop calculations indicate that (a) if I'm at 62% after20 hours, it will take 30 hours to finish and (b) a 10G link takes about29 hours at full blast to move 123T of data.  So it certainly *looks*like it's resending everything.And that's even though at least 100T of that 123T is test data that waswritten by one of our users back on Nov 12/13, and thus theoretically *should*already have been at the remote site.Any ideas what's going on here?___gpfsug-discuss mailing listgpfsug-discuss at spectrumscale.orghttp://gpfsug.org/mailman/listinfo/gpfsug-discuss

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


[gpfsug-discuss] mmchdisk performance/behavior in a stretch cluster config?

2016-11-18 Thread Valdis Kletnieks
So as a basis for our archive solution, we're using a GPFS cluster
in a stretch configuration, with 2 sites separated by about 20ms worth
of 10G link.  Each end has 2 protocol servers doing NFS and 3 NSD servers.
Identical disk arrays and LTFS/EE at both ends, and all metadata and
userdata are replicated to both sites.

We had a fiber issue for about 8 hours yesterday, and as expected (since there
are only 5 quorum nodes, 3 local and 2 at the far end) the far end fell off the
cluster and down'ed all the NSDs on the remote arrays.

There's about 123T of data at each end, 6 million files in there so far.

So after the fiber came back up after a several-hour downtime, I
did the 'mmchdisk archive start -a'.  That was at 17:45 yesterday.
I'm now 20 hours in, at:

  62.15 % complete on Fri Nov 18 13:52:59 2016  (   4768429 inodes with total  
173675926 MB data processed)
  62.17 % complete on Fri Nov 18 13:53:20 2016  (   4769416 inodes with total  
173710731 MB data processed)
  62.18 % complete on Fri Nov 18 13:53:40 2016  (   4772481 inodes with total  
173762456 MB data processed)

network statistics indicate that the 3 local NSDs are all tossing out
packets at about 400Mbytes/second, which means the 10G pipe is pretty damned
close to totally packed full, and the 3 remotes are sending back ACKs
of all the data.

Rough back-of-envelop calculations indicate that (a) if I'm at 62% after
20 hours, it will take 30 hours to finish and (b) a 10G link takes about
29 hours at full blast to move 123T of data.  So it certainly *looks*
like it's resending everything.

And that's even though at least 100T of that 123T is test data that was
written by one of our users back on Nov 12/13, and thus theoretically *should*
already have been at the remote site.

Any ideas what's going on here?
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss