nvermind. Figured it out.
On Fri, Oct 12, 2012 at 3:20 PM, kojie.fu <kojie...@gmail.com> wrote: > > > > > > kojie.fu > > From: Rita > Date: 2012-10-13 03:19 > To: common-user > Subject: Re: distcp question > thanks for the advise. > > Before I push or pull. Are there any tests I can run before I do the > distCP. I am not 100% sure if I have my webhdfs setup properly. > > > > > On Fri, Oct 12, 2012 at 1:01 PM, J. Rottinghuis <jrottingh...@gmail.com > >wrote: > > > Rita, > > > > Are you doing a push from the source cluster or a pull from the target > > cluster? > > > > Doing a pull with distcp using hftp (to accomodate for version > differences) > > has the advantage of slightly fewer transfers of blocks over the TORs. > Each > > block is read from exactly the datanode where it is located, and on the > > target side (where the mappers run) the first write is to the local > > datanode. With RF=3 each block transfers out of the source TOR, into the > > target TOR, out of the first target-cluster TOR into a different > > target-cluster TOR for replica 2 & 3. Overall 2 time out, and 2 times in. > > > > Doing a pull with webhdfs:// the proxy server has to collect all blocks > > from the source DNs, then they get pulled to the target machine. > > Situation is similar as above, with the one extra transfer of all data > > going through the "proxy" server. > > > > Doing a push with webhdfs:// on the target cluster size, the mapper has > to > > collect all blocks from one or more files (depending on # mappers used) > and > > send them to the proxy server, which then writes blocks to the target > > cluster. Advantage on the target cluster is that each block for a large > > multi-block files get spread over different datanodes on the target side. > > But if I'm counting correctly, you'll have the most data transfer. Out of > > each source DN, through source cluster mapper DN, through target proxy > > server, to target DN, and out/in again for replicas 2&3. > > > > So convenience and setup aside, I think the first option would be the > least > > network transfers. > > Now if you're clusters are separated over a WAN, then this may not matter > > all at. > > > > Just something to think about. > > > > Cheers, > > > > Joep > > > > > > On Fri, Oct 12, 2012 at 8:37 AM, Harsh J <ha...@cloudera.com> wrote: > > > > > Rita, > > > > > > I believe, per the implementation, that webhdfs:// URIs should work > > > fine. Please give it a try and let us know. > > > > > > On Fri, Oct 12, 2012 at 7:14 PM, Rita <rmorgan...@gmail.com> wrote: > > > > I have 2 different versions of Hadoop running. I need to copy > > significant > > > > amount of data (100tb) from one cluster to another. I know distcp is > > the > > > > way to do. On the target cluster I have webhdfs running. Would that > > work? > > > > > > > > The DistCp manual says, I need to use "HftpFileSystem". Is that > > necessary > > > > or will webhdfs do the task? > > > > > > > > > > > > > > > > -- > > > > --- Get your facts first, then you can distort them as you please.-- > > > > > > > > > > > > -- > > > Harsh J > > > > > > > > > -- > --- Get your facts first, then you can distort them as you please.-- > -- --- Get your facts first, then you can distort them as you please.--