Thank you Marc. I was just trying to suggest another approach to this email thread.
However i believe, we cannot run mmfind/mmapplypolicy with remote filesystems and can only be run on the owning cluster? In our clusters - All the gpfs clients are generally in there own compute clusters and mount filesystems from other storage clusters - which i thought is one of the recommended designs. The scripts in the /usr/lpp/mmfs/samples/util folder do work with remote filesystems, and thus on the compute nodes. I was also trying to find something that could be used by users and not by superuser… but i guess none of these tools are meant to be run by a user without superuser privileges. Regards, Lohit On Mar 8, 2019, 3:54 PM -0600, Marc A Kaplan <[email protected]>, wrote: > Lohit... Any and all of those commands and techniques should still work with > newer version of GPFS. > > But mmapplypolicy is the supported command for generating file lists. It > uses the GPFS APIs and some parallel processing tricks. > > mmfind is a script that make it easier to write GPFS "policy rules" and runs > mmapplypolicy for you. > > mmxcp can be used with mmfind (and/or mmapplypolicy) to make it easy to run a > cp (or other command) in parallel on those filelists ... > > --marc K of GPFS > > > > From: [email protected] > To: ""gpfsug-discuss<""[email protected] > "<[email protected]>, gpfsug main discussion list > <[email protected]> > Date: 03/08/2019 10:13 AM > Subject: Re: [gpfsug-discuss] Follow-up: migrating billions of files > Sent by: [email protected] > > > > I had to do this twice too. Once i had to copy a 4 PB filesystem as fast as > possible when NSD disk descriptors were corrupted and shutting down GPFS > would have led to me loosing those files forever, and the other was a regular > maintenance but had to copy similar data in less time. > > In both the cases, i just used GPFS provided util scripts in > /usr/lpp/mmfs/samples/util/ . These could be run only as root i believe. I > wish i could give them to users to use. > > I had used few of those scripts like tsreaddir which used to be really fast > in listing all the paths in the directories. It prints full paths of all > files along with there inodes etc. I had modified it to print just the full > file paths. > > I then use these paths and group them up in different groups which gets fed > into a array jobs to the SGE/LSF cluster. > Each array jobs basically uses GNU parallel and running something similar to > rsync -avR . The “-R” option basically creates the directories as given. > Of course this worked because i was using the fast private network to > transfer between the storage systems. Also i know that cp or tar might be > better than rsync with respect to speed, but rsync was convenient and i could > always start over again without checkpointing or remembering where i left off > previously. > > Similar to how Bill mentioned in the previous email, but i used gpfs util > scripts and basic GNU parallel/rsync, SGE/LSF to submit jobs to the cluster > as superuser. It used to work pretty well. > > Since then - I constantly use parallel and rsync to copy large directories. > > Thank you, > Lohit > > On Mar 8, 2019, 7:43 AM -0600, William Abbott <[email protected]>, wrote: > We had a similar situation and ended up using parsyncfp, which generates > multiple parallel rsyncs based on file lists. If they're on the same IB > fabric (as ours were) you can use that instead of ethernet, and it > worked pretty well. One caveat is that you need to follow the parallel > transfers with a final single rsync, so you can use --delete. > > For the initial transfer you can also use bbcp. It can get very good > performance but isn't nearly as convenient as rsync for subsequent > transfers. The performance isn't good with small files but you can use > tar on both ends to deal with that, in a similar way to what Uwe > suggests below. The bbcp documentation outlines how to do that. > > Bill > > On 3/6/19 8:13 AM, Uwe Falke wrote: > Hi, in that case I'd open several tar pipes in parallel, maybe using > directories carefully selected, like > > tar -c <source_dir> | ssh <target_host> "tar -x" > > I am not quite sure whether "-C /" for tar works here ("tar -C / -x"), but > along these lines might be a good efficient method. target_hosts should be > all nodes haveing the target file system mounted, and you should start > those pipes on the nodes with the source file system. > It is best to start with the largest directories, and use some > masterscript to start the tar pipes controlled by semaphores to not > overload anything. > > > > Mit freundlichen Grüßen / Kind regards > > > Dr. Uwe Falke > > IT Specialist > High Performance Computing Services / Integrated Technology Services / > Data Center Services > ------------------------------------------------------------------------------------------------------------------------------------------- > IBM Deutschland > Rathausstr. 7 > 09111 Chemnitz > Phone: +49 371 6978 2165 > Mobile: +49 175 575 2877 > E-Mail: [email protected] > ------------------------------------------------------------------------------------------------------------------------------------------- > IBM Deutschland Business & Technology Services GmbH / Geschäftsführung: > Thomas Wolter, Sven Schooß > Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, > HRB 17122 > > > > > From: "Oesterlin, Robert" <[email protected] > To: gpfsug main discussion list <[email protected] > Date: 06/03/2019 13:44 > Subject: [gpfsug-discuss] Follow-up: migrating billions of files > Sent by: [email protected] > > > > Some of you had questions to my original post. More information: > > Source: > - Files are straight GPFS/Posix - no extended NFSV4 ACLs > - A solution that requires $?s to be spent on software (ie, Aspera) isn?t > a very viable option > - Both source and target clusters are in the same DC > - Source is stand-alone NSD servers (bonded 10g-E) and 8gb FC SAN storage > - Approx 40 file systems, a few large ones with 300M-400M files each, > others smaller > - no independent file sets > - migration must pose minimal disruption to existing users > > Target architecture is a small number of file systems (2-3) on ESS with > independent filesets > - Target (ESS) will have multiple 40gb-E links on each NSD server (GS4) > > My current thinking is AFM with a pre-populate of the file space and > switch the clients over to have them pull data they need (most of the data > is older and less active) and them let AFM populate the rest in the > background. > > > Bob Oesterlin > Sr Principal Storage Engineer, Nuance > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttp-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss%26d%3DDwICAg%26c%3Djf_iaSHvJObTbx-siA1ZOg%26r%3DfTuVGtgq6A14KiNeaGfNZzOOgtHW5Lm4crZU6lJxtB8%26m%3DJ5RpIj-EzFyU_dM9I4P8SrpHMikte_pn9sbllFcOvyM%26s%3DfEwDQyDSL7hvOVPbg_n8o_LDz-cLqSI6lQtSzmhaSoI%26e&data=02%7C01%7Cbabbott%40rutgers.edu%7C8cbda3d651584119393808d6a2358544%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636874748092821399&sdata=W06i8IWqrxgEmdp3htxad0euiRhA6%2Bexd3YAziSrUhg%3D&reserved=0= > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7Cbabbott%40rutgers.edu%7C8cbda3d651584119393808d6a2358544%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636874748092821399&sdata=Pjf4RhUchThoFvWI7hLJO4eWhoTXnIYd9m7Mvf809iE%3D&reserved=0 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
