Re: [gpfsug-discuss] Follow-up: migrating billions of files

valleru Fri, 08 Mar 2019 14:44:08 -0800

Thank you Marc. I was just trying to suggest another approach to this email 
thread.


However i believe, we cannot run mmfind/mmapplypolicy with remote filesystems 
and can only be run on the owning cluster? In our clusters - All the gpfs 
clients are generally in there own compute clusters and mount filesystems from 
other storage clusters - which i thought is one of the recommended designs.
The scripts in the /usr/lpp/mmfs/samples/util folder do work with remote 
filesystems, and thus on the compute nodes.

I was also trying to find something that could be used by users and not by 
superuser… but i guess none of these tools are meant to be run by a user 
without superuser privileges.

Regards,
Lohit

On Mar 8, 2019, 3:54 PM -0600, Marc A Kaplan <[email protected]>, wrote:
> Lohit... Any and all of those commands and techniques should still work with 
> newer version of GPFS.
>
> But mmapplypolicy is the supported command for generating file lists.  It 
> uses the GPFS APIs and some parallel processing tricks.
>
> mmfind is a script that make it easier to write GPFS "policy rules" and runs 
> mmapplypolicy for you.
>
> mmxcp can be used with mmfind (and/or mmapplypolicy) to make it easy to run a 
> cp (or other command) in parallel on those filelists ...
>
> --marc K of GPFS
>
>
>
> From:        [email protected]
> To:        ""gpfsug-discuss<""[email protected]         
> "<[email protected]>, gpfsug main discussion list 
> <[email protected]>
> Date:        03/08/2019 10:13 AM
> Subject:        Re: [gpfsug-discuss] Follow-up: migrating billions of files
> Sent by:        [email protected]
>
>
>
> I had to do this twice too. Once i had to copy a 4 PB filesystem as fast as 
> possible when NSD disk descriptors were corrupted and shutting down GPFS 
> would have led to me loosing those files forever, and the other was a regular 
> maintenance but had to copy similar data in less time.
>
> In both the cases, i just used GPFS provided util scripts in 
> /usr/lpp/mmfs/samples/util/  . These could be run only as root i believe. I 
> wish i could give them to users to use.
>
> I had used few of those scripts like tsreaddir which used to be really fast 
> in listing all the paths in the directories. It prints full paths of all 
> files along with there inodes etc. I had modified it to print just the full 
> file paths.
>
> I then use these paths and group them up in different groups which gets fed 
> into a array jobs to the SGE/LSF cluster.
> Each array jobs basically uses GNU parallel and running something similar to 
> rsync -avR . The “-R” option basically creates the directories as given.
> Of course this worked because i was using the fast private network to 
> transfer between the storage systems. Also i know that cp or tar might be 
> better than rsync with respect to speed, but rsync was convenient and i could 
> always start over again without checkpointing or remembering where i left off 
> previously.
>
> Similar to how Bill mentioned in the previous email, but i used gpfs util 
> scripts and basic GNU parallel/rsync, SGE/LSF to submit jobs to the cluster 
> as superuser. It used to work pretty well.
>
> Since then - I constantly use parallel and rsync to copy large directories.
>
> Thank you,
> Lohit
>
> On Mar 8, 2019, 7:43 AM -0600, William Abbott <[email protected]>, wrote:
> We had a similar situation and ended up using parsyncfp, which generates
> multiple parallel rsyncs based on file lists. If they're on the same IB
> fabric (as ours were) you can use that instead of ethernet, and it
> worked pretty well. One caveat is that you need to follow the parallel
> transfers with a final single rsync, so you can use --delete.
>
> For the initial transfer you can also use bbcp. It can get very good
> performance but isn't nearly as convenient as rsync for subsequent
> transfers. The performance isn't good with small files but you can use
> tar on both ends to deal with that, in a similar way to what Uwe
> suggests below. The bbcp documentation outlines how to do that.
>
> Bill
>
> On 3/6/19 8:13 AM, Uwe Falke wrote:
> Hi, in that case I'd open several tar pipes in parallel, maybe using
> directories carefully selected, like
>
> tar -c <source_dir> | ssh <target_host> "tar -x"
>
> I am not quite sure whether "-C /" for tar works here ("tar -C / -x"), but
> along these lines might be a good efficient method. target_hosts should be
> all nodes haveing the target file system mounted, and you should start
> those pipes on the nodes with the source file system.
> It is best to start with the largest directories, and use some
> masterscript to start the tar pipes controlled by semaphores to not
> overload anything.
>
>
>
> Mit freundlichen Grüßen / Kind regards
>
>
> Dr. Uwe Falke
>
> IT Specialist
> High Performance Computing Services / Integrated Technology Services /
> Data Center Services
> -------------------------------------------------------------------------------------------------------------------------------------------
> IBM Deutschland
> Rathausstr. 7
> 09111 Chemnitz
> Phone: +49 371 6978 2165
> Mobile: +49 175 575 2877
> E-Mail: [email protected]
> -------------------------------------------------------------------------------------------------------------------------------------------
> IBM Deutschland Business & Technology Services GmbH / Geschäftsführung:
> Thomas Wolter, Sven Schooß
> Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart,
> HRB 17122
>
>
>
>
> From: "Oesterlin, Robert" <[email protected]
> To: gpfsug main discussion list <[email protected]
> Date: 06/03/2019 13:44
> Subject: [gpfsug-discuss] Follow-up: migrating billions of files
> Sent by: [email protected]
>
>
>
> Some of you had questions to my original post. More information:
>
> Source:
> - Files are straight GPFS/Posix - no extended NFSV4 ACLs
> - A solution that requires $?s to be spent on software (ie, Aspera) isn?t
> a very viable option
> - Both source and target clusters are in the same DC
> - Source is stand-alone NSD servers (bonded 10g-E) and 8gb FC SAN storage
> - Approx 40 file systems, a few large ones with 300M-400M files each,
> others smaller
> - no independent file sets
> - migration must pose minimal disruption to existing users
>
> Target architecture is a small number of file systems (2-3) on ESS with
> independent filesets
> - Target (ESS) will have multiple 40gb-E links on each NSD server (GS4)
>
> My current thinking is AFM with a pre-populate of the file space and
> switch the clients over to have them pull data they need (most of the data
> is older and less active) and them let AFM populate the rest in the
> background.
>
>
> Bob Oesterlin
> Sr Principal Storage Engineer, Nuance
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttp-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss%26d%3DDwICAg%26c%3Djf_iaSHvJObTbx-siA1ZOg%26r%3DfTuVGtgq6A14KiNeaGfNZzOOgtHW5Lm4crZU6lJxtB8%26m%3DJ5RpIj-EzFyU_dM9I4P8SrpHMikte_pn9sbllFcOvyM%26s%3DfEwDQyDSL7hvOVPbg_n8o_LDz-cLqSI6lQtSzmhaSoI%26e&amp;data=02%7C01%7Cbabbott%40rutgers.edu%7C8cbda3d651584119393808d6a2358544%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636874748092821399&amp;sdata=W06i8IWqrxgEmdp3htxad0euiRhA6%2Bexd3YAziSrUhg%3D&amp;reserved=0=
>
>
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&amp;data=02%7C01%7Cbabbott%40rutgers.edu%7C8cbda3d651584119393808d6a2358544%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636874748092821399&amp;sdata=Pjf4RhUchThoFvWI7hLJO4eWhoTXnIYd9m7Mvf809iE%3D&amp;reserved=0
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] Follow-up: migrating billions of files

Reply via email to