Re: distcp question
Rita, Are you doing a push from the source cluster or a pull from the target cluster? Doing a pull with distcp using hftp (to accomodate for version differences) has the advantage of slightly fewer transfers of blocks over the TORs. Each block is read from exactly the datanode where it is located, and on the target side (where the mappers run) the first write is to the local datanode. With RF=3 each block transfers out of the source TOR, into the target TOR, out of the first target-cluster TOR into a different target-cluster TOR for replica 2 3. Overall 2 time out, and 2 times in. Doing a pull with webhdfs:// the proxy server has to collect all blocks from the source DNs, then they get pulled to the target machine. Situation is similar as above, with the one extra transfer of all data going through the proxy server. Doing a push with webhdfs:// on the target cluster size, the mapper has to collect all blocks from one or more files (depending on # mappers used) and send them to the proxy server, which then writes blocks to the target cluster. Advantage on the target cluster is that each block for a large multi-block files get spread over different datanodes on the target side. But if I'm counting correctly, you'll have the most data transfer. Out of each source DN, through source cluster mapper DN, through target proxy server, to target DN, and out/in again for replicas 23. So convenience and setup aside, I think the first option would be the least network transfers. Now if you're clusters are separated over a WAN, then this may not matter all at. Just something to think about. Cheers, Joep On Fri, Oct 12, 2012 at 8:37 AM, Harsh J ha...@cloudera.com wrote: Rita, I believe, per the implementation, that webhdfs:// URIs should work fine. Please give it a try and let us know. On Fri, Oct 12, 2012 at 7:14 PM, Rita rmorgan...@gmail.com wrote: I have 2 different versions of Hadoop running. I need to copy significant amount of data (100tb) from one cluster to another. I know distcp is the way to do. On the target cluster I have webhdfs running. Would that work? The DistCp manual says, I need to use HftpFileSystem. Is that necessary or will webhdfs do the task? -- --- Get your facts first, then you can distort them as you please.-- -- Harsh J
Re: distcp question
thanks for the advise. Before I push or pull. Are there any tests I can run before I do the distCP. I am not 100% sure if I have my webhdfs setup properly. On Fri, Oct 12, 2012 at 1:01 PM, J. Rottinghuis jrottingh...@gmail.comwrote: Rita, Are you doing a push from the source cluster or a pull from the target cluster? Doing a pull with distcp using hftp (to accomodate for version differences) has the advantage of slightly fewer transfers of blocks over the TORs. Each block is read from exactly the datanode where it is located, and on the target side (where the mappers run) the first write is to the local datanode. With RF=3 each block transfers out of the source TOR, into the target TOR, out of the first target-cluster TOR into a different target-cluster TOR for replica 2 3. Overall 2 time out, and 2 times in. Doing a pull with webhdfs:// the proxy server has to collect all blocks from the source DNs, then they get pulled to the target machine. Situation is similar as above, with the one extra transfer of all data going through the proxy server. Doing a push with webhdfs:// on the target cluster size, the mapper has to collect all blocks from one or more files (depending on # mappers used) and send them to the proxy server, which then writes blocks to the target cluster. Advantage on the target cluster is that each block for a large multi-block files get spread over different datanodes on the target side. But if I'm counting correctly, you'll have the most data transfer. Out of each source DN, through source cluster mapper DN, through target proxy server, to target DN, and out/in again for replicas 23. So convenience and setup aside, I think the first option would be the least network transfers. Now if you're clusters are separated over a WAN, then this may not matter all at. Just something to think about. Cheers, Joep On Fri, Oct 12, 2012 at 8:37 AM, Harsh J ha...@cloudera.com wrote: Rita, I believe, per the implementation, that webhdfs:// URIs should work fine. Please give it a try and let us know. On Fri, Oct 12, 2012 at 7:14 PM, Rita rmorgan...@gmail.com wrote: I have 2 different versions of Hadoop running. I need to copy significant amount of data (100tb) from one cluster to another. I know distcp is the way to do. On the target cluster I have webhdfs running. Would that work? The DistCp manual says, I need to use HftpFileSystem. Is that necessary or will webhdfs do the task? -- --- Get your facts first, then you can distort them as you please.-- -- Harsh J -- --- Get your facts first, then you can distort them as you please.--
Re: Re: distcp question
kojie.fu From: Rita Date: 2012-10-13 03:19 To: common-user Subject: Re: distcp question thanks for the advise. Before I push or pull. Are there any tests I can run before I do the distCP. I am not 100% sure if I have my webhdfs setup properly. On Fri, Oct 12, 2012 at 1:01 PM, J. Rottinghuis jrottingh...@gmail.comwrote: Rita, Are you doing a push from the source cluster or a pull from the target cluster? Doing a pull with distcp using hftp (to accomodate for version differences) has the advantage of slightly fewer transfers of blocks over the TORs. Each block is read from exactly the datanode where it is located, and on the target side (where the mappers run) the first write is to the local datanode. With RF=3 each block transfers out of the source TOR, into the target TOR, out of the first target-cluster TOR into a different target-cluster TOR for replica 2 3. Overall 2 time out, and 2 times in. Doing a pull with webhdfs:// the proxy server has to collect all blocks from the source DNs, then they get pulled to the target machine. Situation is similar as above, with the one extra transfer of all data going through the proxy server. Doing a push with webhdfs:// on the target cluster size, the mapper has to collect all blocks from one or more files (depending on # mappers used) and send them to the proxy server, which then writes blocks to the target cluster. Advantage on the target cluster is that each block for a large multi-block files get spread over different datanodes on the target side. But if I'm counting correctly, you'll have the most data transfer. Out of each source DN, through source cluster mapper DN, through target proxy server, to target DN, and out/in again for replicas 23. So convenience and setup aside, I think the first option would be the least network transfers. Now if you're clusters are separated over a WAN, then this may not matter all at. Just something to think about. Cheers, Joep On Fri, Oct 12, 2012 at 8:37 AM, Harsh J ha...@cloudera.com wrote: Rita, I believe, per the implementation, that webhdfs:// URIs should work fine. Please give it a try and let us know. On Fri, Oct 12, 2012 at 7:14 PM, Rita rmorgan...@gmail.com wrote: I have 2 different versions of Hadoop running. I need to copy significant amount of data (100tb) from one cluster to another. I know distcp is the way to do. On the target cluster I have webhdfs running. Would that work? The DistCp manual says, I need to use HftpFileSystem. Is that necessary or will webhdfs do the task? -- --- Get your facts first, then you can distort them as you please.-- -- Harsh J -- --- Get your facts first, then you can distort them as you please.--
Re: Re: distcp question
nvermind. Figured it out. On Fri, Oct 12, 2012 at 3:20 PM, kojie.fu kojie...@gmail.com wrote: kojie.fu From: Rita Date: 2012-10-13 03:19 To: common-user Subject: Re: distcp question thanks for the advise. Before I push or pull. Are there any tests I can run before I do the distCP. I am not 100% sure if I have my webhdfs setup properly. On Fri, Oct 12, 2012 at 1:01 PM, J. Rottinghuis jrottingh...@gmail.com wrote: Rita, Are you doing a push from the source cluster or a pull from the target cluster? Doing a pull with distcp using hftp (to accomodate for version differences) has the advantage of slightly fewer transfers of blocks over the TORs. Each block is read from exactly the datanode where it is located, and on the target side (where the mappers run) the first write is to the local datanode. With RF=3 each block transfers out of the source TOR, into the target TOR, out of the first target-cluster TOR into a different target-cluster TOR for replica 2 3. Overall 2 time out, and 2 times in. Doing a pull with webhdfs:// the proxy server has to collect all blocks from the source DNs, then they get pulled to the target machine. Situation is similar as above, with the one extra transfer of all data going through the proxy server. Doing a push with webhdfs:// on the target cluster size, the mapper has to collect all blocks from one or more files (depending on # mappers used) and send them to the proxy server, which then writes blocks to the target cluster. Advantage on the target cluster is that each block for a large multi-block files get spread over different datanodes on the target side. But if I'm counting correctly, you'll have the most data transfer. Out of each source DN, through source cluster mapper DN, through target proxy server, to target DN, and out/in again for replicas 23. So convenience and setup aside, I think the first option would be the least network transfers. Now if you're clusters are separated over a WAN, then this may not matter all at. Just something to think about. Cheers, Joep On Fri, Oct 12, 2012 at 8:37 AM, Harsh J ha...@cloudera.com wrote: Rita, I believe, per the implementation, that webhdfs:// URIs should work fine. Please give it a try and let us know. On Fri, Oct 12, 2012 at 7:14 PM, Rita rmorgan...@gmail.com wrote: I have 2 different versions of Hadoop running. I need to copy significant amount of data (100tb) from one cluster to another. I know distcp is the way to do. On the target cluster I have webhdfs running. Would that work? The DistCp manual says, I need to use HftpFileSystem. Is that necessary or will webhdfs do the task? -- --- Get your facts first, then you can distort them as you please.-- -- Harsh J -- --- Get your facts first, then you can distort them as you please.-- -- --- Get your facts first, then you can distort them as you please.--