Re: Looking for suggestions to deal with large backups not completing in 24-hours: the GWDG solution briefly explained

Zoltan Forray Tue, 17 Jul 2018 06:07:09 -0700

 Bjørn,

Thank you for the details.  As the common consensus is, we need to break-up
the number of directories/files each node processes/scans. Also seem to
need the use of the PROXY NODE process to consolidate access into one
node/client since 5+ nodes will be required to process what is now being
attempted through 1-node.


On Tue, Jul 17, 2018 at 8:05 AM Nachtwey, Bjoern <bjoern.nacht...@gwdg.de>
wrote:

> Hi Zoltan,
>
> i will come back to the approach Jonas mentioned (as I'm the author of
> that text: thanks to Jonas for doing this ;-) )
>
> the text is in german of course, but the script has some comments in
> English and will be understandable -- I hope so :-)
>
> the text describes first the problem everybody on this list will know: the
> treewalk takes more times than we have.
> TSM/ISP has some opportunities to speed up, such as "-incrbydate", but
> they do not work properly.
>
> So for me the only solution is to parallelize the tree walk and do partial
> incremental backups.
> First tried to write it with BASH commands, but multithreading was not
> easy to implement and second it won't run on windows -- but our largest
> filers ( 500 TB - 1.2 PB) need to be accessed via CIFS to store the ACL
> information.
> My first steps with PowerShell for the Windows cost lots of time and were
> disappointing.
> Using PERL made everything really easy as it runs on windows with the
> strawberry perl software and within the script there are only a few
> if-conditions needed to determine between Linux and Windows.
>
> I did some tests according to the depth or the level of the filetree to
> dive in:
> As the subfolders are of unequal size, diving just below the mount point
> and parallelize on the folders of this "first level" mostly does not work
> well, there's (nearly) always one folder taking all the time.
> On the other hand diving into all levels will take a certain amount of
> additional time.
>
> The best performance I do see using 3 to 4 levels and 4 to 6 parallel
> threads for each node. Due to separating users and for accounting I have
> several nodes on such large file systems. So in total there are about 20 to
> 40 streams in parallel.
>
> Rudi Wüst mentioned in my text figured out a p520 server running AIX6 will
> support up to 2,000 parallel streams, but as mentioned by Grant using an
> isilon system the filer will be the bottle neck.
>
> As mentioned by Del, you may also test a commercial software "MAGS" by
> general storage, it can addresses multiple isilon nodes in parallel
>
> If there're any questions -- just ask or have a look on the script:
> https://gitlab.gwdg.de/bnachtw/dsmci
>
> // even if the last submit is about 4 month old, the project is still in
> development ;-)
>
>
>
> ==> maybe I should update and translate the text from the "GWDG news" to
> English? Any interest?
>
>
> Best
> Bjørn
>
>
> p.s.
> A Result from the wild (weekly backup of a node from a 343 TB Quantum
> StorNext File System) :
> >>
> Process ID            :                12988
> Path processed        : <removed>
> -------------------------------------------------
> Start time            :     2018-07-14 12:00
> End time              :     2018-07-15 06:07
> total processing time :       3d 15h 59m 23s
> total wallclock time  :           18h 7m 30s
> effective speedup     :                4.855 using 6 parallel threads
> datatransfertime ratio:                3.575 %
> -------------------------------------------------
> Objects inspected     :             92061596
> Objects backed up     :              9774876
> Objects updated       :                    0
> Objects deleted       :                    0
> Objects expired       :                 7696
> Objects failed        :                    0
> Bytes inspected       :            52818.242 (GB)
> Bytes transferred     :             5063.620 (GB)
> -------------------------------------------------
> Number of Errors      :                    0
> Number of Warnings    :                   43
> # of severe Errors    :                    0
> # Out-of-Space Errors :                    0
> <<
>
> --------------------------------------------------------------------------------------------------
>
> Bjørn Nachtwey
>
> Arbeitsgruppe "IT-Infrastruktur“
> Tel.: +49 551 201-2181, E-Mail: bjoern.nacht...@gwdg.de
> --------------------------------------------------------------------------------------------------
>
> Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
> Am Faßberg 11, 37077 Göttingen, URL: http://www.gwdg.de
> Tel.: +49 551 201-1510, Fax: +49 551 201-2150, E-Mail: g...@gwdg.de
> Service-Hotline: Tel.: +49 551 201-1523, E-Mail: supp...@gwdg.de
> Geschäftsführer: Prof. Dr. Ramin Yahyapour
> Aufsichtsratsvorsitzender: Prof. Dr. Norbert Lossau
> Sitz der Gesellschaft: Göttingen
> Registergericht: Göttingen, Handelsregister-Nr. B 598
> --------------------------------------------------------------------------------------------------
>
> Zertifiziert nach ISO 9001
>
> --------------------------------------------------------------------------------------------------
>
> -----Ursprüngliche Nachricht-----
> Von: ADSM: Dist Stor Manager <ADSM-L@VM.MARIST.EDU> Im Auftrag von Zoltan
> Forray
> Gesendet: Mittwoch, 11. Juli 2018 13:50
> An: ADSM-L@VM.MARIST.EDU
> Betreff: Re: [ADSM-L] Looking for suggestions to deal with large backups
> not completing in 24-hours
>
> I will need to translate to English but I gather it is talking about the
> RESOURCEUTILZATION / MAXNUMMP values.  While we have increased MAXNUMMP to
> 5 on the server (will try going higher),  not sure how much good it would
> do since the backup schedule uses OBJECTS to point to a specific/single
> mountpoint/filesystem (see below) but is worth trying to bump the
> RESOURCEUTILIZATION value on the client even higher...
>
> We have checked the dsminstr.log file and it is spending 92% of the time
> in PROCESS DIRS (no surprise)
>
> 7:46:25 AM   SUN : q schedule * ISILON-SOM-SOMADFS1 f=d
>             Policy Domain Name: DFS
>                  Schedule Name: ISILON-SOM-SOMADFS1
>                    Description: ISILON-SOM-SOMADFS1
>                         Action: Incremental
>                      Subaction:
>                        Options: -subdir=yes
>                        Objects: \\rams.adp.vcu.edu\SOM\TSM\SOMADFS1\*
>                       Priority: 5
>                Start Date/Time: 12/05/2017 08:30:00
>                       Duration: 1 Hour(s)
>     Maximum Run Time (Minutes): 0
>                 Schedule Style: Enhanced
>                         Period:
>                    Day of Week: Any
>                          Month: Any
>                   Day of Month: Any
>                  Week of Month: Any
>                     Expiration:
> Last Update by (administrator): ZFORRAY
>          Last Update Date/Time: 01/12/2018 10:30:48
>               Managing profile:
>
>
> On Tue, Jul 10, 2018 at 4:06 AM Jansen, Jonas <jan...@itc.rwth-aachen.de>
> wrote:
>
> > It is possible to da a parallel backup of file system parts.
> > https://www.gwdg.de/documents/20182/27257/GN_11-2016_www.pdf (german)
> > have a look on page 10.
> >
> > ---
> > Jonas Jansen
> >
> > IT Center
> > Gruppe: Server & Storage
> > Abteilung: Systeme & Betrieb
> > RWTH Aachen University
> > Seffenter Weg 23
> > 52074 Aachen
> > Tel: +49 241 80-28784
> > Fax: +49 241 80-22134
> > jan...@itc.rwth-aachen.de
> > www.itc.rwth-aachen.de
> >
> > -----Original Message-----
> > From: ADSM: Dist Stor Manager <ADSM-L@VM.MARIST.EDU> On Behalf Of Del
> > Hoobler
> > Sent: Monday, July 9, 2018 3:29 PM
> > To: ADSM-L@VM.MARIST.EDU
> > Subject: Re: [ADSM-L] Looking for suggestions to deal with large
> > backups not completing in 24-hours
> >
> > They are a 3rd-party partner that offers an integrated Spectrum
> > Protect solution for large filer backups.
> >
> >
> > Del
> >
> > ----------------------------------------------------
> >
> > "ADSM: Dist Stor Manager" <ADSM-L@VM.MARIST.EDU> wrote on 07/09/2018
> > 09:17:06 AM:
> >
> > > From: Zoltan Forray <zfor...@vcu.edu>
> > > To: ADSM-L@VM.MARIST.EDU
> > > Date: 07/09/2018 09:17 AM
> > > Subject: Re: Looking for suggestions to deal with large backups not
> > > completing in 24-hours Sent by: "ADSM: Dist Stor Manager"
> > > <ADSM-L@VM.MARIST.EDU>
> > >
> > > Thanks Del.  Very interesting.  Are they a VAR for IBM?
> > >
> > > Not sure if it would work in the current configuration we are using
> > > to
> > back
> > > up ISILON. I have passed the info on.
> > >
> > > BTW, FWIW, when I copied/pasted the info, Chrome spell-checker
> > red-flagged
> > > on "The easy way to incrementally backup billons of objects"
> (billions).
> > > So if you know anybody at the company, please pass it on to them.
> > >
> > > On Mon, Jul 9, 2018 at 6:51 AM Del Hoobler <hoob...@us.ibm.com> wrote:
> > >
> > > > Another possible idea is to look at General Storage dsmISI MAGS:
> > > >
> > > >         INVALID URI REMOVED
> > >
> >
> > u=http-3A__www.general-2Dstorage.com_PRODUCTS_products.html&d=DwIBaQ&c
> > =jf_ia
> > SHvJObTbx-
> > >
> >
> > siA1ZOg&r=0hq2JX5c3TEZNriHEs7Zf7HrkY2fNtONOrEOM8Txvk8&m=ofZM7gZ7p5GL1H
> > FyHU75
> > lwUZLmc_kYAQxroVCZQUCSs&s=25_psxEcE0fvxruxybvMJZzSZv-
> > > ach7r-VHXaLNVD_E&e=
> > > >
> > > >
> > > > Del
> > > >
> > > >
> > > > "ADSM: Dist Stor Manager" <ADSM-L@VM.MARIST.EDU> wrote on
> > > > 07/05/2018
> > > > 02:52:27 PM:
> > > >
> > > > > From: Zoltan Forray <zfor...@vcu.edu>
> > > > > To: ADSM-L@VM.MARIST.EDU
> > > > > Date: 07/05/2018 02:53 PM
> > > > > Subject: Looking for suggestions to deal with large backups not
> > > > > completing in 24-hours Sent by: "ADSM: Dist Stor Manager"
> > > > > <ADSM-L@VM.MARIST.EDU>
> > > > >
> > > > > As I have mentioned in the past, we have gone through large
> > migrations
> > > > to
> > > > > DFS based storage on EMC ISILON hardware.  As you may recall, we
> > backup
> > > > > these DFS mounts (about 90 at last count) using multiple Windows
> > servers
> > > > > that run multiple ISP nodes (about 30-each) and they access each
> > > > > DFS mount/filesystem via -object=\\rams.adp.vcu.edu
> \departmentname.
> > > > >
> > > > > This has lead to lots of performance issue with backups and some
> > > > > departments are now complain that their backups are running into
> > > > > multiple-days in some cases.
> > > > >
> > > > > One such case in a department with 2-nodes with over 30-million
> > objects
> > > > for
> > > > > each node.  In the past, their backups were able to finish
> > > > > quicker
> > since
> > > > > they were accessed via dedicated servers and were able to use
> > Journaling
> > > > to
> > > > > reduce the scan times.  Unless things have changed, I believe
> > Journling
> > > > is
> > > > > not an option due to how the files are accessed.
> > > > >
> > > > > FWIW, average backups are usually <50k files and <200GB once it
> > finished
> > > > > scanning.....
> > > > >
> > > > > Also, the idea of HSM/SPACEMANAGEMENT has reared its ugly head
> > > > > since
> > > > many
> > > > > of these objects haven't been accessed in many years old. But as
> > > > > I understand it, that won't work either given our current
> > configuration.
> > > > >
> > > > > Given the current DFS configuration (previously CIFS), what can
> > > > > we
> > do to
> > > > > improve backup performance?
> > > > >
> > > > > So, any-and-all ideas are up for discussion.  There is even
> > discussion
> > > > on
> > > > > replacing ISP/TSM due to these issues/limitations.
> > > > >
> > > > > --
> > > > > *Zoltan Forray*
> > > > > Spectrum Protect (p.k.a. TSM) Software & Hardware Administrator
> > > > > Xymon Monitor Administrator VMware Administrator Virginia
> > > > > Commonwealth University UCC/Office of Technology Services
> > > > > www.ucc.vcu.edu zfor...@vcu.edu - 804-828-4807 Don't be a
> > > > > phishing victim - VCU and other reputable organizations
> > will
> > > > > never use email to request that you reply with your password,
> > > > > social security number or confidential personal information. For
> > > > > more
> > details
> > > > > visit INVALID URI REMOVED
> > > > > u=http-3A__phishing.vcu.edu_&d=DwIBaQ&c=jf_iaSHvJObTbx-
> > > > > siA1ZOg&r=0hq2JX5c3TEZNriHEs7Zf7HrkY2fNtONOrEOM8Txvk8&m=5bz_TktY
> > > > > 3-
> > > > > a432oKYronO-w1z-
> > > > > ax8md3tzFqX9nGxoU&s=EudIhVvfUVx4-5UmfJHaRUzHCd7Agwk3Pog8wmEEpdA&
> > > > > e=
> > > > >
> > > >
> > >
> > >
> > > --
> > > *Zoltan Forray*
> > > Spectrum Protect (p.k.a. TSM) Software & Hardware Administrator
> > > Xymon Monitor Administrator VMware Administrator Virginia
> > > Commonwealth University UCC/Office of Technology Services
> > > www.ucc.vcu.edu zfor...@vcu.edu - 804-828-4807 Don't be a phishing
> > > victim - VCU and other reputable organizations will never use email
> > > to request that you reply with your password, social security number
> > > or confidential personal information. For more details visit INVALID
> > > URI REMOVED
> > > u=http-3A__phishing.vcu.edu_&d=DwIBaQ&c=jf_iaSHvJObTbx-
> > >
> >
> > siA1ZOg&r=0hq2JX5c3TEZNriHEs7Zf7HrkY2fNtONOrEOM8Txvk8&m=ofZM7gZ7p5GL1H
> > FyHU75
> > lwUZLmc_kYAQxroVCZQUCSs&s=umTd28h-
> > > GlxqSvNShsNIqm8D1PcanVk0HPcP5KTurKw&e=
> > >
> >
>
>
> --
> *Zoltan Forray*
> Spectrum Protect (p.k.a. TSM) Software & Hardware Administrator Xymon
> Monitor Administrator VMware Administrator Virginia Commonwealth University
> UCC/Office of Technology Services www.ucc.vcu.edu zfor...@vcu.edu -
> 804-828-4807 Don't be a phishing victim - VCU and other reputable
> organizations will never use email to request that you reply with your
> password, social security number or confidential personal information. For
> more details visit http://phishing.vcu.edu/
>


-- 
*Zoltan Forray*
Spectrum Protect (p.k.a. TSM) Software & Hardware Administrator
Xymon Monitor Administrator
VMware Administrator
Virginia Commonwealth University
UCC/Office of Technology Services
www.ucc.vcu.edu
zfor...@vcu.edu - 804-828-4807
Don't be a phishing victim - VCU and other reputable organizations will
never use email to request that you reply with your password, social
security number or confidential personal information. For more details
visit http://phishing.vcu.edu/

Re: Looking for suggestions to deal with large backups not completing in 24-hours: the GWDG solution briefly explained

Reply via email to