Re: distcp question

2012-10-12 Thread J. Rottinghuis
Rita,

Are you doing a push from the source cluster or a pull from the target
cluster?

Doing a pull with distcp using hftp (to accomodate for version differences)
has the advantage of slightly fewer transfers of blocks over the TORs. Each
block is read from exactly the datanode where it is located, and on the
target side (where the mappers run) the first write is to the local
datanode. With RF=3 each block transfers out of the source TOR, into the
target TOR, out of the first target-cluster TOR into a different
target-cluster TOR for replica 2  3. Overall 2 time out, and 2 times in.

Doing a pull with webhdfs:// the proxy server has to collect all blocks
from the source DNs, then they get pulled to the target machine.
Situation is similar as above, with the one extra transfer of all data
going through the proxy server.

Doing a push with webhdfs:// on the target cluster size, the mapper has to
collect all blocks from one or more files (depending on # mappers used) and
send them to the proxy server, which then writes blocks to the target
cluster. Advantage on the target cluster is that each block for a large
multi-block files get spread over different datanodes on the target side.
But if I'm counting correctly, you'll have the most data transfer. Out of
each source DN, through source cluster mapper DN, through target proxy
server, to target DN, and out/in again for replicas 23.

So convenience and setup aside, I think the first option would be the least
network transfers.
Now if you're clusters are separated over a WAN, then this may not matter
all at.

Just something to think about.

Cheers,

Joep


On Fri, Oct 12, 2012 at 8:37 AM, Harsh J ha...@cloudera.com wrote:

 Rita,

 I believe, per the implementation, that webhdfs:// URIs should work
 fine. Please give it a try and let us know.

 On Fri, Oct 12, 2012 at 7:14 PM, Rita rmorgan...@gmail.com wrote:
  I have 2 different versions of Hadoop running. I need to copy significant
  amount of data  (100tb) from one cluster to another. I know distcp is the
  way to do. On the target cluster I have webhdfs running. Would that work?
 
  The DistCp manual says, I need to use HftpFileSystem. Is that necessary
  or will webhdfs do the task?
 
 
 
  --
  --- Get your facts first, then you can distort them as you please.--



 --
 Harsh J



Re: distcp question

2012-10-12 Thread Rita
thanks for the advise.

Before I push or pull. Are there any tests I can run before I do the
distCP. I am not 100% sure if I have my webhdfs setup properly.




On Fri, Oct 12, 2012 at 1:01 PM, J. Rottinghuis jrottingh...@gmail.comwrote:

 Rita,

 Are you doing a push from the source cluster or a pull from the target
 cluster?

 Doing a pull with distcp using hftp (to accomodate for version differences)
 has the advantage of slightly fewer transfers of blocks over the TORs. Each
 block is read from exactly the datanode where it is located, and on the
 target side (where the mappers run) the first write is to the local
 datanode. With RF=3 each block transfers out of the source TOR, into the
 target TOR, out of the first target-cluster TOR into a different
 target-cluster TOR for replica 2  3. Overall 2 time out, and 2 times in.

 Doing a pull with webhdfs:// the proxy server has to collect all blocks
 from the source DNs, then they get pulled to the target machine.
 Situation is similar as above, with the one extra transfer of all data
 going through the proxy server.

 Doing a push with webhdfs:// on the target cluster size, the mapper has to
 collect all blocks from one or more files (depending on # mappers used) and
 send them to the proxy server, which then writes blocks to the target
 cluster. Advantage on the target cluster is that each block for a large
 multi-block files get spread over different datanodes on the target side.
 But if I'm counting correctly, you'll have the most data transfer. Out of
 each source DN, through source cluster mapper DN, through target proxy
 server, to target DN, and out/in again for replicas 23.

 So convenience and setup aside, I think the first option would be the least
 network transfers.
 Now if you're clusters are separated over a WAN, then this may not matter
 all at.

 Just something to think about.

 Cheers,

 Joep


 On Fri, Oct 12, 2012 at 8:37 AM, Harsh J ha...@cloudera.com wrote:

  Rita,
 
  I believe, per the implementation, that webhdfs:// URIs should work
  fine. Please give it a try and let us know.
 
  On Fri, Oct 12, 2012 at 7:14 PM, Rita rmorgan...@gmail.com wrote:
   I have 2 different versions of Hadoop running. I need to copy
 significant
   amount of data  (100tb) from one cluster to another. I know distcp is
 the
   way to do. On the target cluster I have webhdfs running. Would that
 work?
  
   The DistCp manual says, I need to use HftpFileSystem. Is that
 necessary
   or will webhdfs do the task?
  
  
  
   --
   --- Get your facts first, then you can distort them as you please.--
 
 
 
  --
  Harsh J
 




-- 
--- Get your facts first, then you can distort them as you please.--


Re: Re: distcp question

2012-10-12 Thread kojie . fu





kojie.fu

From: Rita
Date: 2012-10-13 03:19
To: common-user
Subject: Re: distcp question
thanks for the advise.

Before I push or pull. Are there any tests I can run before I do the
distCP. I am not 100% sure if I have my webhdfs setup properly.




On Fri, Oct 12, 2012 at 1:01 PM, J. Rottinghuis jrottingh...@gmail.comwrote:

 Rita,

 Are you doing a push from the source cluster or a pull from the target
 cluster?

 Doing a pull with distcp using hftp (to accomodate for version differences)
 has the advantage of slightly fewer transfers of blocks over the TORs. Each
 block is read from exactly the datanode where it is located, and on the
 target side (where the mappers run) the first write is to the local
 datanode. With RF=3 each block transfers out of the source TOR, into the
 target TOR, out of the first target-cluster TOR into a different
 target-cluster TOR for replica 2  3. Overall 2 time out, and 2 times in.

 Doing a pull with webhdfs:// the proxy server has to collect all blocks
 from the source DNs, then they get pulled to the target machine.
 Situation is similar as above, with the one extra transfer of all data
 going through the proxy server.

 Doing a push with webhdfs:// on the target cluster size, the mapper has to
 collect all blocks from one or more files (depending on # mappers used) and
 send them to the proxy server, which then writes blocks to the target
 cluster. Advantage on the target cluster is that each block for a large
 multi-block files get spread over different datanodes on the target side.
 But if I'm counting correctly, you'll have the most data transfer. Out of
 each source DN, through source cluster mapper DN, through target proxy
 server, to target DN, and out/in again for replicas 23.

 So convenience and setup aside, I think the first option would be the least
 network transfers.
 Now if you're clusters are separated over a WAN, then this may not matter
 all at.

 Just something to think about.

 Cheers,

 Joep


 On Fri, Oct 12, 2012 at 8:37 AM, Harsh J ha...@cloudera.com wrote:

  Rita,
 
  I believe, per the implementation, that webhdfs:// URIs should work
  fine. Please give it a try and let us know.
 
  On Fri, Oct 12, 2012 at 7:14 PM, Rita rmorgan...@gmail.com wrote:
   I have 2 different versions of Hadoop running. I need to copy
 significant
   amount of data  (100tb) from one cluster to another. I know distcp is
 the
   way to do. On the target cluster I have webhdfs running. Would that
 work?
  
   The DistCp manual says, I need to use HftpFileSystem. Is that
 necessary
   or will webhdfs do the task?
  
  
  
   --
   --- Get your facts first, then you can distort them as you please.--
 
 
 
  --
  Harsh J
 




-- 
--- Get your facts first, then you can distort them as you please.--

Re: Re: distcp question

2012-10-12 Thread Rita
nvermind. Figured it out.


On Fri, Oct 12, 2012 at 3:20 PM, kojie.fu kojie...@gmail.com wrote:






 kojie.fu

 From: Rita
 Date: 2012-10-13 03:19
 To: common-user
 Subject: Re: distcp question
 thanks for the advise.

 Before I push or pull. Are there any tests I can run before I do the
 distCP. I am not 100% sure if I have my webhdfs setup properly.




 On Fri, Oct 12, 2012 at 1:01 PM, J. Rottinghuis jrottingh...@gmail.com
 wrote:

  Rita,
 
  Are you doing a push from the source cluster or a pull from the target
  cluster?
 
  Doing a pull with distcp using hftp (to accomodate for version
 differences)
  has the advantage of slightly fewer transfers of blocks over the TORs.
 Each
  block is read from exactly the datanode where it is located, and on the
  target side (where the mappers run) the first write is to the local
  datanode. With RF=3 each block transfers out of the source TOR, into the
  target TOR, out of the first target-cluster TOR into a different
  target-cluster TOR for replica 2  3. Overall 2 time out, and 2 times in.
 
  Doing a pull with webhdfs:// the proxy server has to collect all blocks
  from the source DNs, then they get pulled to the target machine.
  Situation is similar as above, with the one extra transfer of all data
  going through the proxy server.
 
  Doing a push with webhdfs:// on the target cluster size, the mapper has
 to
  collect all blocks from one or more files (depending on # mappers used)
 and
  send them to the proxy server, which then writes blocks to the target
  cluster. Advantage on the target cluster is that each block for a large
  multi-block files get spread over different datanodes on the target side.
  But if I'm counting correctly, you'll have the most data transfer. Out of
  each source DN, through source cluster mapper DN, through target proxy
  server, to target DN, and out/in again for replicas 23.
 
  So convenience and setup aside, I think the first option would be the
 least
  network transfers.
  Now if you're clusters are separated over a WAN, then this may not matter
  all at.
 
  Just something to think about.
 
  Cheers,
 
  Joep
 
 
  On Fri, Oct 12, 2012 at 8:37 AM, Harsh J ha...@cloudera.com wrote:
 
   Rita,
  
   I believe, per the implementation, that webhdfs:// URIs should work
   fine. Please give it a try and let us know.
  
   On Fri, Oct 12, 2012 at 7:14 PM, Rita rmorgan...@gmail.com wrote:
I have 2 different versions of Hadoop running. I need to copy
  significant
amount of data  (100tb) from one cluster to another. I know distcp is
  the
way to do. On the target cluster I have webhdfs running. Would that
  work?
   
The DistCp manual says, I need to use HftpFileSystem. Is that
  necessary
or will webhdfs do the task?
   
   
   
--
--- Get your facts first, then you can distort them as you please.--
  
  
  
   --
   Harsh J
  
 



 --
 --- Get your facts first, then you can distort them as you please.--




-- 
--- Get your facts first, then you can distort them as you please.--