subject:"\[gpfsug\-discuss\] Follow\-up\: migrating billions of files"

Re: [gpfsug-discuss] Follow-up: migrating billions of files

2019-03-08 Thread valleru

Thank you Marc. I was just trying to suggest another approach to this email 
thread.

However i believe, we cannot run mmfind/mmapplypolicy with remote filesystems 
and can only be run on the owning cluster? In our clusters - All the gpfs 
clients are generally in there own compute clusters and mount filesystems from 
other storage clusters - which i thought is one of the recommended designs.
The scripts in the /usr/lpp/mmfs/samples/util folder do work with remote 
filesystems, and thus on the compute nodes.

I was also trying to find something that could be used by users and not by 
superuser… but i guess none of these tools are meant to be run by a user 
without superuser privileges.

Regards,
Lohit

On Mar 8, 2019, 3:54 PM -0600, Marc A Kaplan , wrote:
> Lohit... Any and all of those commands and techniques should still work with 
> newer version of GPFS.
>
> But mmapplypolicy is the supported command for generating file lists.  It 
> uses the GPFS APIs and some parallel processing tricks.
>
> mmfind is a script that make it easier to write GPFS "policy rules" and runs 
> mmapplypolicy for you.
>
> mmxcp can be used with mmfind (and/or mmapplypolicy) to make it easy to run a 
> cp (or other command) in parallel on those filelists ...
>
> --marc K of GPFS
>
>
>
> From:        vall...@cbio.mskcc.org
> To:        ""gpfsug-discuss<""gpfsug-discuss@spectrumscale.org         
> ", gpfsug main discussion list 
> 
> Date:        03/08/2019 10:13 AM
> Subject:        Re: [gpfsug-discuss] Follow-up: migrating billions of files
> Sent by:        gpfsug-discuss-boun...@spectrumscale.org
>
>
>
> I had to do this twice too. Once i had to copy a 4 PB filesystem as fast as 
> possible when NSD disk descriptors were corrupted and shutting down GPFS 
> would have led to me loosing those files forever, and the other was a regular 
> maintenance but had to copy similar data in less time.
>
> In both the cases, i just used GPFS provided util scripts in 
> /usr/lpp/mmfs/samples/util/  . These could be run only as root i believe. I 
> wish i could give them to users to use.
>
> I had used few of those scripts like tsreaddir which used to be really fast 
> in listing all the paths in the directories. It prints full paths of all 
> files along with there inodes etc. I had modified it to print just the full 
> file paths.
>
> I then use these paths and group them up in different groups which gets fed 
> into a array jobs to the SGE/LSF cluster.
> Each array jobs basically uses GNU parallel and running something similar to 
> rsync -avR . The “-R” option basically creates the directories as given.
> Of course this worked because i was using the fast private network to 
> transfer between the storage systems. Also i know that cp or tar might be 
> better than rsync with respect to speed, but rsync was convenient and i could 
> always start over again without checkpointing or remembering where i left off 
> previously.
>
> Similar to how Bill mentioned in the previous email, but i used gpfs util 
> scripts and basic GNU parallel/rsync, SGE/LSF to submit jobs to the cluster 
> as superuser. It used to work pretty well.
>
> Since then - I constantly use parallel and rsync to copy large directories.
>
> Thank you,
> Lohit
>
> On Mar 8, 2019, 7:43 AM -0600, William Abbott , wrote:
> We had a similar situation and ended up using parsyncfp, which generates
> multiple parallel rsyncs based on file lists. If they're on the same IB
> fabric (as ours were) you can use that instead of ethernet, and it
> worked pretty well. One caveat is that you need to follow the parallel
> transfers with a final single rsync, so you can use --delete.
>
> For the initial transfer you can also use bbcp. It can get very good
> performance but isn't nearly as convenient as rsync for subsequent
> transfers. The performance isn't good with small files but you can use
> tar on both ends to deal with that, in a similar way to what Uwe
> suggests below. The bbcp documentation outlines how to do that.
>
> Bill
>
> On 3/6/19 8:13 AM, Uwe Falke wrote:
> Hi, in that case I'd open several tar pipes in parallel, maybe using
> directories carefully selected, like
>
> tar -c  | ssh  "tar -x"
>
> I am not quite sure whether "-C /" for tar works here ("tar -C / -x"), but
> along these lines might be a good efficient method. target_hosts should be
> all nodes haveing the target file system mounted, and you should start
> those pipes on the nodes with the source file system.
> It is best to start with the largest directories, and use some
> masterscript to start the tar pipes controlled by semaphores to not
> overload anything.
>
>
>
> Mit

Re: [gpfsug-discuss] Follow-up: migrating billions of files

2019-03-08 Thread valleru

I had to do this twice too. Once i had to copy a 4 PB filesystem as fast as 
possible when NSD disk descriptors were corrupted and shutting down GPFS would 
have led to me loosing those files forever, and the other was a regular 
maintenance but had to copy similar data in less time.

In both the cases, i just used GPFS provided util scripts in 
/usr/lpp/mmfs/samples/util/  . These could be run only as root i believe. I 
wish i could give them to users to use.

I had used few of those scripts like tsreaddir which used to be really fast in 
listing all the paths in the directories. It prints full paths of all files 
along with there inodes etc. I had modified it to print just the full file 
paths.

I then use these paths and group them up in different groups which gets fed 
into a array jobs to the SGE/LSF cluster.
Each array jobs basically uses GNU parallel and running something similar to 
rsync -avR . The “-R” option basically creates the directories as given.
Of course this worked because i was using the fast private network to transfer 
between the storage systems. Also i know that cp or tar might be better than 
rsync with respect to speed, but rsync was convenient and i could always start 
over again without checkpointing or remembering where i left off previously.

Similar to how Bill mentioned in the previous email, but i used gpfs util 
scripts and basic GNU parallel/rsync, SGE/LSF to submit jobs to the cluster as 
superuser. It used to work pretty well.

Since then - I constantly use parallel and rsync to copy large directories.

Thank you,
Lohit

On Mar 8, 2019, 7:43 AM -0600, William Abbott , wrote:
> We had a similar situation and ended up using parsyncfp, which generates
> multiple parallel rsyncs based on file lists. If they're on the same IB
> fabric (as ours were) you can use that instead of ethernet, and it
> worked pretty well. One caveat is that you need to follow the parallel
> transfers with a final single rsync, so you can use --delete.
>
> For the initial transfer you can also use bbcp. It can get very good
> performance but isn't nearly as convenient as rsync for subsequent
> transfers. The performance isn't good with small files but you can use
> tar on both ends to deal with that, in a similar way to what Uwe
> suggests below. The bbcp documentation outlines how to do that.
>
> Bill
>
> On 3/6/19 8:13 AM, Uwe Falke wrote:
> > Hi, in that case I'd open several tar pipes in parallel, maybe using
> > directories carefully selected, like
> >
> > tar -c  | ssh  "tar -x"
> >
> > I am not quite sure whether "-C /" for tar works here ("tar -C / -x"), but
> > along these lines might be a good efficient method. target_hosts should be
> > all nodes haveing the target file system mounted, and you should start
> > those pipes on the nodes with the source file system.
> > It is best to start with the largest directories, and use some
> > masterscript to start the tar pipes controlled by semaphores to not
> > overload anything.
> >
> >
> >
> > Mit freundlichen Grüßen / Kind regards
> >
> >
> > Dr. Uwe Falke
> >
> > IT Specialist
> > High Performance Computing Services / Integrated Technology Services /
> > Data Center Services
> > ---
> > IBM Deutschland
> > Rathausstr. 7
> > 09111 Chemnitz
> > Phone: +49 371 6978 2165
> > Mobile: +49 175 575 2877
> > E-Mail: uwefa...@de.ibm.com
> > ---
> > IBM Deutschland Business & Technology Services GmbH / Geschäftsführung:
> > Thomas Wolter, Sven Schooß
> > Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart,
> > HRB 17122
> >
> >
> >
> >
> > From: "Oesterlin, Robert"  > To: gpfsug main discussion list  > Date: 06/03/2019 13:44
> > Subject: [gpfsug-discuss] Follow-up: migrating billions of files
> > Sent by: gpfsug-discuss-boun...@spectrumscale.org
> >
> >
> >
> > Some of you had questions to my original post. More information:
> >
> > Source:
> > - Files are straight GPFS/Posix - no extended NFSV4 ACLs
> > - A solution that requires $?s to be spent on software (ie, Aspera) isn?t
> > a very viable option
> > - Both source and target clusters are in the same DC
> > - Source is stand-alone NSD servers (bonded 10g-E) and 8gb FC SAN storage
> > - Approx 40 file systems, a few large ones with 300M-400M files each,
> > others smaller
> > - no ind

Re: [gpfsug-discuss] Follow-up: migrating billions of files

2019-03-08 Thread William Abbott

We had a similar situation and ended up using parsyncfp, which generates 
multiple parallel rsyncs based on file lists.  If they're on the same IB 
fabric (as ours were) you can use that instead of ethernet, and it 
worked pretty well.  One caveat is that you need to follow the parallel 
transfers with a final single rsync, so you can use --delete.

For the initial transfer you can also use bbcp.  It can get very good 
performance but isn't nearly as convenient as rsync for subsequent 
transfers.  The performance isn't good with small files but you can use 
tar on both ends to deal with that, in a similar way to what Uwe 
suggests below.  The bbcp documentation outlines how to do that.

Bill

On 3/6/19 8:13 AM, Uwe Falke wrote:
> Hi, in that case I'd open several tar pipes in parallel, maybe using
> directories carefully selected, like
> 
>tar -c  | ssh   "tar -x"
> 
> I am not quite sure whether "-C /" for tar works here ("tar -C / -x"), but
> along these lines might be a good efficient method. target_hosts should be
> all nodes haveing the target file system mounted, and you should start
> those pipes on the nodes with the source file system.
> It is best to start with the largest directories, and use some
> masterscript to start the tar pipes controlled by semaphores  to not
> overload anything.
> 
> 
>   
> Mit freundlichen Grüßen / Kind regards
> 
>   
> Dr. Uwe Falke
>   
> IT Specialist
> High Performance Computing Services / Integrated Technology Services /
> Data Center Services
> ---
> IBM Deutschland
> Rathausstr. 7
> 09111 Chemnitz
> Phone: +49 371 6978 2165
> Mobile: +49 175 575 2877
> E-Mail: uwefa...@de.ibm.com
> ---
> IBM Deutschland Business & Technology Services GmbH / Geschäftsführung:
> Thomas Wolter, Sven Schooß
> Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart,
> HRB 17122
> 
> 
> 
> 
> From:   "Oesterlin, Robert" 
> To: gpfsug main discussion list 
> Date:   06/03/2019 13:44
> Subject:[gpfsug-discuss] Follow-up: migrating billions of files
> Sent by:gpfsug-discuss-boun...@spectrumscale.org
> 
> 
> 
> Some of you had questions to my original post. More information:
>   
> Source:
> - Files are straight GPFS/Posix - no extended NFSV4 ACLs
> - A solution that requires $?s to be spent on software (ie, Aspera) isn?t
> a very viable option
> - Both source and target clusters are in the same DC
> - Source is stand-alone NSD servers (bonded 10g-E) and 8gb FC SAN storage
> - Approx 40 file systems, a few large ones with 300M-400M files each,
> others smaller
> - no independent file sets
> - migration must pose minimal disruption to existing users
>   
> Target architecture is a small number of file systems (2-3) on ESS with
> independent filesets
> - Target (ESS) will have multiple 40gb-E links on each NSD server (GS4)
>   
> My current thinking is AFM with a pre-populate of the file space and
> switch the clients over to have them pull data they need (most of the data
> is older and less active) and them let AFM populate the rest in the
> background.
>   
>   
> Bob Oesterlin
> Sr Principal Storage Engineer, Nuance
>   ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttp-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss%26d%3DDwICAg%26c%3Djf_iaSHvJObTbx-siA1ZOg%26r%3DfTuVGtgq6A14KiNeaGfNZzOOgtHW5Lm4crZU6lJxtB8%26m%3DJ5RpIj-EzFyU_dM9I4P8SrpHMikte_pn9sbllFcOvyM%26s%3DfEwDQyDSL7hvOVPbg_n8o_LDz-cLqSI6lQtSzmhaSoI%26edata=02%7C01%7Cbabbott%40rutgers.edu%7C8cbda3d651584119393808d6a2358544%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636874748092821399sdata=W06i8IWqrxgEmdp3htxad0euiRhA6%2Bexd3YAziSrUhg%3Dreserved=0=
> 
> 
> 
> 
> 
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discussdata=02%7C01%7Cbabbott%40rutgers.edu%7C8cbda3d651584119393808d6a2358544%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636874748092821399sdata=Pjf4RhUchThoFvWI7hLJO4eWhoTXnIYd9m7Mvf809iE%3Dreserved=0
> 
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] Follow-up: migrating billions of files

2019-03-07 Thread Jonathan Buzzard

On Wed, 2019-03-06 at 12:44 +, Oesterlin, Robert wrote:
> Some of you had questions to my original post. More information:
>  
> Source:
> - Files are straight GPFS/Posix - no extended NFSV4 ACLs
> - A solution that requires $’s to be spent on software (ie, Aspera)
> isn’t a very viable option
> - Both source and target clusters are in the same DC
> - Source is stand-alone NSD servers (bonded 10g-E) and 8gb FC SAN
> storage
> - Approx 40 file systems, a few large ones with 300M-400M files each,
> others smaller
> - no independent file sets
> - migration must pose minimal disruption to existing users
>  
> Target architecture is a small number of file systems (2-3) on ESS
> with independent filesets
> - Target (ESS) will have multiple 40gb-E links on each NSD server
> (GS4)
>  
> My current thinking is AFM with a pre-populate of the file space and
> switch the clients over to have them pull data they need (most of the
> data is older and less active) and them let AFM populate the rest in
> the background.
>  

As it's not been mentioned yet "dsmc restore" or equivalent depending
on your backup solution.

JAB.

-- 
Jonathan A. Buzzard Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG



___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] Follow-up: migrating billions of files

2019-03-07 Thread Uwe Falke

As for "making sure a subjob doesn't finish right after you go home 
leaving a slot idle for several hours ". 
That's the reason for the masterscript / control script / whatever. 
There would be a list of directories sorted to decreasing size, 
the master script would have a counter for each participating source host 
(a semaphore) and start as many parallel copy jobs, each with the 
currently topmost directory in the list, removing that directory (best 
possibly to an intermediary "in-work" list), counting down the semaphore 
on each start , unless 0. As soon as a job returns successfully, count up 
the semaphore, and if >0, start the next job, and so on. I suppose you can 
easily run about 8 to 12  such jobs per server (maybe best to use 
dedicated source server - dest server pairs).  So, no worries about 
leaving at any time WRT jobs ending and idle job slots . 

of course, some precautions should be taken to ensure each job succeeds 
and gets repeated if not , and a lot of logging should take place to be 
sure you would know what's happened.


 
Mit freundlichen Grüßen / Kind regards

 
Dr. Uwe Falke
 
IT Specialist
High Performance Computing Services / Integrated Technology Services / 
Data Center Services
---
IBM Deutschland
Rathausstr. 7
09111 Chemnitz
Phone: +49 371 6978 2165
Mobile: +49 175 575 2877
E-Mail: uwefa...@de.ibm.com
---
IBM Deutschland Business & Technology Services GmbH / Geschäftsführung: 
Thomas Wolter, Sven Schooß
Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, 
HRB 17122 




From:   Stephen Ulmer 
To: gpfsug main discussion list 
Date:   06/03/2019 16:55
Subject:    Re: [gpfsug-discuss] Follow-up: migrating billions of 
files
Sent by:gpfsug-discuss-boun...@spectrumscale.org



In the case where tar -C doesn?t work, you can always use a subshell (I do 
this regularly):

tar -cf . | ssh someguy@otherhost "(cd targetdir; tar -xvf - )"

Only use -v on one end. :)

Also, for parallel work that?s not designed that way, don't underestimate 
the -P option to GNU and BSD xargs! With the amount of stuff to be copied, 
making sure a subjob doesn?t finish right after you go home leaving a slot 
idle for several hours is a medium deal.

In Bob?s case, however, treating it like a DR exercise where users 
"restore" their own files by accessing them (using AFM instead of HSM) is 
probably the most convenient.

-- 
Stephen



On Mar 6, 2019, at 8:13 AM, Uwe Falke  wrote:

Hi, in that case I'd open several tar pipes in parallel, maybe using 
directories carefully selected, like 

 tar -c  | ssh   "tar -x"

I am not quite sure whether "-C /" for tar works here ("tar -C / -x"), but 

along these lines might be a good efficient method. target_hosts should be 

all nodes haveing the target file system mounted, and you should start 
those pipes on the nodes with the source file system. 
It is best to start with the largest directories, and use some 
masterscript to start the tar pipes controlled by semaphores  to not 
overload anything. 



Mit freundlichen Grüßen / Kind regards


Dr. Uwe Falke

IT Specialist
High Performance Computing Services / Integrated Technology Services / 
Data Center Services
---
IBM Deutschland
Rathausstr. 7
09111 Chemnitz
Phone: +49 371 6978 2165
Mobile: +49 175 575 2877
E-Mail: uwefa...@de.ibm.com
---
IBM Deutschland Business & Technology Services GmbH / Geschäftsführung: 
Thomas Wolter, Sven Schooß
Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, 
HRB 17122 




From:   "Oesterlin, Robert" 
To: gpfsug main discussion list 
Date:   06/03/2019 13:44
Subject:[gpfsug-discuss] Follow-up: migrating billions of files
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Some of you had questions to my original post. More information:

Source:
- Files are straight GPFS/Posix - no extended NFSV4 ACLs
- A solution that requires $?s to be spent on software (ie, Aspera) isn?t 
a very viable option
- Both source and target clusters are in the same DC
- Source is stand-alone NSD servers (bonded 10g-E) and 8gb FC SAN storage
- Approx 40 file systems, a few large ones with 300M-400M files each, 
others smaller
- no independent file sets
- migration must pose minimal disruption to existing users

Target architecture is a small number of file systems (2-3) on ESS with 
independent f

Re: [gpfsug-discuss] Follow-up: migrating billions of files

2019-03-06 Thread Stephen Ulmer

In the case where tar -C doesn’t work, you can always use a subshell (I do this 
regularly):

tar -cf . | ssh someguy@otherhost "(cd targetdir; tar -xvf - )"

Only use -v on one end. :)

Also, for parallel work that’s not designed that way, don't underestimate the 
-P option to GNU and BSD xargs! With the amount of stuff to be copied, making 
sure a subjob doesn’t finish right after you go home leaving a slot idle for 
several hours is a medium deal.

In Bob’s case, however, treating it like a DR exercise where users "restore" 
their own files by accessing them (using AFM instead of HSM) is probably the 
most convenient.

-- 
Stephen



> On Mar 6, 2019, at 8:13 AM, Uwe Falke  <mailto:uwefa...@de.ibm.com>> wrote:
> 
> Hi, in that case I'd open several tar pipes in parallel, maybe using 
> directories carefully selected, like 
> 
>  tar -c  | ssh   "tar -x"
> 
> I am not quite sure whether "-C /" for tar works here ("tar -C / -x"), but 
> along these lines might be a good efficient method. target_hosts should be 
> all nodes haveing the target file system mounted, and you should start 
> those pipes on the nodes with the source file system. 
> It is best to start with the largest directories, and use some 
> masterscript to start the tar pipes controlled by semaphores  to not 
> overload anything. 
> 
> 
> 
> Mit freundlichen Grüßen / Kind regards
> 
> 
> Dr. Uwe Falke
> 
> IT Specialist
> High Performance Computing Services / Integrated Technology Services / 
> Data Center Services
> ---
> IBM Deutschland
> Rathausstr. 7
> 09111 Chemnitz
> Phone: +49 371 6978 2165
> Mobile: +49 175 575 2877
> E-Mail: uwefa...@de.ibm.com <mailto:uwefa...@de.ibm.com>
> ---
> IBM Deutschland Business & Technology Services GmbH / Geschäftsführung: 
> Thomas Wolter, Sven Schooß
> Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, 
> HRB 17122 
> 
> 
> 
> 
> From:   "Oesterlin, Robert"  <mailto:robert.oester...@nuance.com>>
> To: gpfsug main discussion list  <mailto:gpfsug-discuss@spectrumscale.org>>
> Date:   06/03/2019 13:44
> Subject:[gpfsug-discuss] Follow-up: migrating billions of files
> Sent by:gpfsug-discuss-boun...@spectrumscale.org 
> <mailto:gpfsug-discuss-boun...@spectrumscale.org>
> 
> 
> 
> Some of you had questions to my original post. More information:
> 
> Source:
> - Files are straight GPFS/Posix - no extended NFSV4 ACLs
> - A solution that requires $?s to be spent on software (ie, Aspera) isn?t 
> a very viable option
> - Both source and target clusters are in the same DC
> - Source is stand-alone NSD servers (bonded 10g-E) and 8gb FC SAN storage
> - Approx 40 file systems, a few large ones with 300M-400M files each, 
> others smaller
> - no independent file sets
> - migration must pose minimal disruption to existing users
> 
> Target architecture is a small number of file systems (2-3) on ESS with 
> independent filesets
> - Target (ESS) will have multiple 40gb-E links on each NSD server (GS4)
> 
> My current thinking is AFM with a pre-populate of the file space and 
> switch the clients over to have them pull data they need (most of the data 
> is older and less active) and them let AFM populate the rest in the 
> background.
> 
> 
> Bob Oesterlin
> Sr Principal Storage Engineer, Nuance
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org <http://spectrumscale.org/>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss=DwICAg=jf_iaSHvJObTbx-siA1ZOg=fTuVGtgq6A14KiNeaGfNZzOOgtHW5Lm4crZU6lJxtB8=J5RpIj-EzFyU_dM9I4P8SrpHMikte_pn9sbllFcOvyM=fEwDQyDSL7hvOVPbg_n8o_LDz-cLqSI6lQtSzmhaSoI=
>  
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss=DwICAg=jf_iaSHvJObTbx-siA1ZOg=fTuVGtgq6A14KiNeaGfNZzOOgtHW5Lm4crZU6lJxtB8=J5RpIj-EzFyU_dM9I4P8SrpHMikte_pn9sbllFcOvyM=fEwDQyDSL7hvOVPbg_n8o_LDz-cLqSI6lQtSzmhaSoI=>
> 
> 
> 
> 
> 
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org <http://spectrumscale.org/>
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss 
> <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] Follow-up: migrating billions of files

2019-03-06 Thread Uwe Falke

Hi, in that case I'd open several tar pipes in parallel, maybe using 
directories carefully selected, like 

  tar -c  | ssh   "tar -x"

I am not quite sure whether "-C /" for tar works here ("tar -C / -x"), but 
along these lines might be a good efficient method. target_hosts should be 
all nodes haveing the target file system mounted, and you should start 
those pipes on the nodes with the source file system. 
It is best to start with the largest directories, and use some 
masterscript to start the tar pipes controlled by semaphores  to not 
overload anything. 


 
Mit freundlichen Grüßen / Kind regards

 
Dr. Uwe Falke
 
IT Specialist
High Performance Computing Services / Integrated Technology Services / 
Data Center Services
---
IBM Deutschland
Rathausstr. 7
09111 Chemnitz
Phone: +49 371 6978 2165
Mobile: +49 175 575 2877
E-Mail: uwefa...@de.ibm.com
---
IBM Deutschland Business & Technology Services GmbH / Geschäftsführung: 
Thomas Wolter, Sven Schooß
Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, 
HRB 17122 




From:   "Oesterlin, Robert" 
To: gpfsug main discussion list 
Date:   06/03/2019 13:44
Subject:    [gpfsug-discuss] Follow-up: migrating billions of files
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Some of you had questions to my original post. More information:
 
Source:
- Files are straight GPFS/Posix - no extended NFSV4 ACLs
- A solution that requires $?s to be spent on software (ie, Aspera) isn?t 
a very viable option
- Both source and target clusters are in the same DC
- Source is stand-alone NSD servers (bonded 10g-E) and 8gb FC SAN storage
- Approx 40 file systems, a few large ones with 300M-400M files each, 
others smaller
- no independent file sets
- migration must pose minimal disruption to existing users
 
Target architecture is a small number of file systems (2-3) on ESS with 
independent filesets
- Target (ESS) will have multiple 40gb-E links on each NSD server (GS4)
 
My current thinking is AFM with a pre-populate of the file space and 
switch the clients over to have them pull data they need (most of the data 
is older and less active) and them let AFM populate the rest in the 
background.
 
 
Bob Oesterlin
Sr Principal Storage Engineer, Nuance
 ___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss=DwICAg=jf_iaSHvJObTbx-siA1ZOg=fTuVGtgq6A14KiNeaGfNZzOOgtHW5Lm4crZU6lJxtB8=J5RpIj-EzFyU_dM9I4P8SrpHMikte_pn9sbllFcOvyM=fEwDQyDSL7hvOVPbg_n8o_LDz-cLqSI6lQtSzmhaSoI=





___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

[gpfsug-discuss] Follow-up: migrating billions of files

2019-03-06 Thread Oesterlin, Robert

Some of you had questions to my original post. More information:

Source:
- Files are straight GPFS/Posix - no extended NFSV4 ACLs
- A solution that requires $’s to be spent on software (ie, Aspera) isn’t a 
very viable option
- Both source and target clusters are in the same DC
- Source is stand-alone NSD servers (bonded 10g-E) and 8gb FC SAN storage
- Approx 40 file systems, a few large ones with 300M-400M files each, others 
smaller
- no independent file sets
- migration must pose minimal disruption to existing users

Target architecture is a small number of file systems (2-3) on ESS with 
independent filesets
- Target (ESS) will have multiple 40gb-E links on each NSD server (GS4)

My current thinking is AFM with a pre-populate of the file space and switch the 
clients over to have them pull data they need (most of the data is older and 
less active) and them let AFM populate the rest in the background.


Bob Oesterlin
Sr Principal Storage Engineer, Nuance

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] Follow-up: migrating billions of files

Re: [gpfsug-discuss] Follow-up: migrating billions of files

Re: [gpfsug-discuss] Follow-up: migrating billions of files

Re: [gpfsug-discuss] Follow-up: migrating billions of files

Re: [gpfsug-discuss] Follow-up: migrating billions of files

Re: [gpfsug-discuss] Follow-up: migrating billions of files

Re: [gpfsug-discuss] Follow-up: migrating billions of files

[gpfsug-discuss] Follow-up: migrating billions of files

8 matches

Site Navigation

Mail list logo

Footer information