thanks for the advise. Before I push or pull. Are there any tests I can run before I do the distCP. I am not 100% sure if I have my webhdfs setup properly.
On Fri, Oct 12, 2012 at 1:01 PM, J. Rottinghuis <jrottingh...@gmail.com>wrote: > Rita, > > Are you doing a push from the source cluster or a pull from the target > cluster? > > Doing a pull with distcp using hftp (to accomodate for version differences) > has the advantage of slightly fewer transfers of blocks over the TORs. Each > block is read from exactly the datanode where it is located, and on the > target side (where the mappers run) the first write is to the local > datanode. With RF=3 each block transfers out of the source TOR, into the > target TOR, out of the first target-cluster TOR into a different > target-cluster TOR for replica 2 & 3. Overall 2 time out, and 2 times in. > > Doing a pull with webhdfs:// the proxy server has to collect all blocks > from the source DNs, then they get pulled to the target machine. > Situation is similar as above, with the one extra transfer of all data > going through the "proxy" server. > > Doing a push with webhdfs:// on the target cluster size, the mapper has to > collect all blocks from one or more files (depending on # mappers used) and > send them to the proxy server, which then writes blocks to the target > cluster. Advantage on the target cluster is that each block for a large > multi-block files get spread over different datanodes on the target side. > But if I'm counting correctly, you'll have the most data transfer. Out of > each source DN, through source cluster mapper DN, through target proxy > server, to target DN, and out/in again for replicas 2&3. > > So convenience and setup aside, I think the first option would be the least > network transfers. > Now if you're clusters are separated over a WAN, then this may not matter > all at. > > Just something to think about. > > Cheers, > > Joep > > > On Fri, Oct 12, 2012 at 8:37 AM, Harsh J <ha...@cloudera.com> wrote: > > > Rita, > > > > I believe, per the implementation, that webhdfs:// URIs should work > > fine. Please give it a try and let us know. > > > > On Fri, Oct 12, 2012 at 7:14 PM, Rita <rmorgan...@gmail.com> wrote: > > > I have 2 different versions of Hadoop running. I need to copy > significant > > > amount of data (100tb) from one cluster to another. I know distcp is > the > > > way to do. On the target cluster I have webhdfs running. Would that > work? > > > > > > The DistCp manual says, I need to use "HftpFileSystem". Is that > necessary > > > or will webhdfs do the task? > > > > > > > > > > > > -- > > > --- Get your facts first, then you can distort them as you please.-- > > > > > > > > -- > > Harsh J > > > -- --- Get your facts first, then you can distort them as you please.--