Re: connecting two clusters
Clusters don't really have identities beyond the addresses of the NameNodes and JobTrackers. In the example below, nn1 and nn2 are the hostnames of the namenodes of the source and destination clusters. The 8020 in each address assumes that they're on the default port. Hadoop provides no inter-task or inter-job synchronization primitives, on purpose (even within a cluster, the most you get in terms of synchronization is the ability to join on the status of a running job to determine that it's completed). The model is designed to be as identity-independent as possible to make it more resiliant to failure. If individual jobs/tasks could lock common resources, then the intermittent failure of tasks could easily cause deadlock. Using a file as a scoreboard or other communication mechanism between multiple jobs is not something explicitly designed for, and likely to end in frustration. Can you describe the goal you're trying to accomplish? It's likely that there's another, more MapReduce-y way of looking at the job and refactoring the code to make it work more cleanly with the intended programming model. - Aaron On Mon, Apr 6, 2009 at 10:08 PM, Mithila Nagendra mnage...@asu.edu wrote: Thanks! I was looking at the link sent by Philip. The copy is done with the following command: hadoop distcp hdfs://nn1:8020/foo/bar \ hdfs://nn2:8020/bar/foo I was wondering if nn1 and nn2 are the names of the clusters or the name of the masters on each cluster. I wanted map/reduce tasks running on each of the two clusters to communicate with each other. I dont know if hadoop provides for synchronization between two map/reduce tasks. The tasks run simultaneouly, and they need to access a common file - something like a map/reduce task at a higher level utilizing the data produced by the map/reduce at the lower level. Mithila On Tue, Apr 7, 2009 at 7:57 AM, Owen O'Malley omal...@apache.org wrote: On Apr 6, 2009, at 9:49 PM, Mithila Nagendra wrote: Hey all I'm trying to connect two separate Hadoop clusters. Is it possible to do so? I need data to be shuttled back and forth between the two clusters. Any suggestions? You should use hadoop distcp. It is a map/reduce program that copies data, typically from one cluster to another. If you have the hftp interface enabled, you can use that to copy between hdfs clusters that are different versions. hadoop distcp hftp://namenode1:1234/foo/bar hdfs://foo/bar -- Owen
Re: connecting two clusters
Aaron, We hope to achieve a level of pipelining between two clusters - similar to how pipelining is done in executing RDB queries. You can look at it as the producer-consumer problem, one cluster produces some data and the other cluster consumes it. The issue that has to be dealt with here is the data exchange between the clusters - synchronized interaction between the map-reduce jobs on the two clusters is what I m hoping to achieve. Mithila On Tue, Apr 7, 2009 at 10:10 AM, Aaron Kimball aa...@cloudera.com wrote: Clusters don't really have identities beyond the addresses of the NameNodes and JobTrackers. In the example below, nn1 and nn2 are the hostnames of the namenodes of the source and destination clusters. The 8020 in each address assumes that they're on the default port. Hadoop provides no inter-task or inter-job synchronization primitives, on purpose (even within a cluster, the most you get in terms of synchronization is the ability to join on the status of a running job to determine that it's completed). The model is designed to be as identity-independent as possible to make it more resiliant to failure. If individual jobs/tasks could lock common resources, then the intermittent failure of tasks could easily cause deadlock. Using a file as a scoreboard or other communication mechanism between multiple jobs is not something explicitly designed for, and likely to end in frustration. Can you describe the goal you're trying to accomplish? It's likely that there's another, more MapReduce-y way of looking at the job and refactoring the code to make it work more cleanly with the intended programming model. - Aaron On Mon, Apr 6, 2009 at 10:08 PM, Mithila Nagendra mnage...@asu.edu wrote: Thanks! I was looking at the link sent by Philip. The copy is done with the following command: hadoop distcp hdfs://nn1:8020/foo/bar \ hdfs://nn2:8020/bar/foo I was wondering if nn1 and nn2 are the names of the clusters or the name of the masters on each cluster. I wanted map/reduce tasks running on each of the two clusters to communicate with each other. I dont know if hadoop provides for synchronization between two map/reduce tasks. The tasks run simultaneouly, and they need to access a common file - something like a map/reduce task at a higher level utilizing the data produced by the map/reduce at the lower level. Mithila On Tue, Apr 7, 2009 at 7:57 AM, Owen O'Malley omal...@apache.org wrote: On Apr 6, 2009, at 9:49 PM, Mithila Nagendra wrote: Hey all I'm trying to connect two separate Hadoop clusters. Is it possible to do so? I need data to be shuttled back and forth between the two clusters. Any suggestions? You should use hadoop distcp. It is a map/reduce program that copies data, typically from one cluster to another. If you have the hftp interface enabled, you can use that to copy between hdfs clusters that are different versions. hadoop distcp hftp://namenode1:1234/foo/bar hdfs://foo/bar -- Owen
Re: connecting two clusters
Hi Mithila, Unfortunately, Hadoop MapReduce jobs determine their inputs as soon as they begin; the inputs for the job are then fixed. So additional files that arrive in the input directory after processing has begun, etc, do not participate in the job. And HDFS does not currently support appends to files, so existing files cannot be updated. A typical way in which this sort of problem is handled is to do processing in incremental wavefronts; process A generates some data which goes in an incoming directory for process B; process B starts on a timer every so often and collects the new input files and works on them. After it's done, it moves those inputs which it processed into a done directory. In the mean time, new files may have arrived. After another time interval, another round of process B starts. The major limitation of this model is that it requires that your process work incrementally, or that you are emitting a small enough volume of data each time in process B that subsequent iterations can load into memory a summary table of results from previous iterations. Look into using the DistributedCache to disseminate such files. Also, why are you using two MapReduce clusters for this, as opposed to one? Is there a common HDFS cluster behind them? You'll probably get much better performance for the overall process if the output data from one job does not need to be transferred to another cluster before it is further processed. Does this model make sense? - Aaron On Tue, Apr 7, 2009 at 1:06 AM, Mithila Nagendra mnage...@asu.edu wrote: Aaron, We hope to achieve a level of pipelining between two clusters - similar to how pipelining is done in executing RDB queries. You can look at it as the producer-consumer problem, one cluster produces some data and the other cluster consumes it. The issue that has to be dealt with here is the data exchange between the clusters - synchronized interaction between the map-reduce jobs on the two clusters is what I m hoping to achieve. Mithila On Tue, Apr 7, 2009 at 10:10 AM, Aaron Kimball aa...@cloudera.com wrote: Clusters don't really have identities beyond the addresses of the NameNodes and JobTrackers. In the example below, nn1 and nn2 are the hostnames of the namenodes of the source and destination clusters. The 8020 in each address assumes that they're on the default port. Hadoop provides no inter-task or inter-job synchronization primitives, on purpose (even within a cluster, the most you get in terms of synchronization is the ability to join on the status of a running job to determine that it's completed). The model is designed to be as identity-independent as possible to make it more resiliant to failure. If individual jobs/tasks could lock common resources, then the intermittent failure of tasks could easily cause deadlock. Using a file as a scoreboard or other communication mechanism between multiple jobs is not something explicitly designed for, and likely to end in frustration. Can you describe the goal you're trying to accomplish? It's likely that there's another, more MapReduce-y way of looking at the job and refactoring the code to make it work more cleanly with the intended programming model. - Aaron On Mon, Apr 6, 2009 at 10:08 PM, Mithila Nagendra mnage...@asu.edu wrote: Thanks! I was looking at the link sent by Philip. The copy is done with the following command: hadoop distcp hdfs://nn1:8020/foo/bar \ hdfs://nn2:8020/bar/foo I was wondering if nn1 and nn2 are the names of the clusters or the name of the masters on each cluster. I wanted map/reduce tasks running on each of the two clusters to communicate with each other. I dont know if hadoop provides for synchronization between two map/reduce tasks. The tasks run simultaneouly, and they need to access a common file - something like a map/reduce task at a higher level utilizing the data produced by the map/reduce at the lower level. Mithila On Tue, Apr 7, 2009 at 7:57 AM, Owen O'Malley omal...@apache.org wrote: On Apr 6, 2009, at 9:49 PM, Mithila Nagendra wrote: Hey all I'm trying to connect two separate Hadoop clusters. Is it possible to do so? I need data to be shuttled back and forth between the two clusters. Any suggestions? You should use hadoop distcp. It is a map/reduce program that copies data, typically from one cluster to another. If you have the hftp interface enabled, you can use that to copy between hdfs clusters that are different versions. hadoop distcp hftp://namenode1:1234/foo/bar hdfs://foo/bar -- Owen
Re: connecting two clusters
Hello Aaron Yes it makes a lot of sense! Thank you! :) The incremental wavefront model is another option we are looking at. Currently we have a two map/reduce levels, the upper level has to wait until the lower map/reduce has produced the entire result set. We want to avoid this... We were thinking of using two separate clusters so that these levels can run on them - hoping to achieve better resource utilization. We were hoping to connect the two clusters in some way so that the processes can interact - but it seems like Hadoop is limited in that sense. I was wondering how a common HDFS system can be setup for this purpose. I tried looking for material for synchronization between two map-reduce clusters - there is limited/no data available out on the Web! If we stick to the incremental wavefront model, then we could probably work with one cluster. Mithila On Tue, Apr 7, 2009 at 7:05 PM, Aaron Kimball aa...@cloudera.com wrote: Hi Mithila, Unfortunately, Hadoop MapReduce jobs determine their inputs as soon as they begin; the inputs for the job are then fixed. So additional files that arrive in the input directory after processing has begun, etc, do not participate in the job. And HDFS does not currently support appends to files, so existing files cannot be updated. A typical way in which this sort of problem is handled is to do processing in incremental wavefronts; process A generates some data which goes in an incoming directory for process B; process B starts on a timer every so often and collects the new input files and works on them. After it's done, it moves those inputs which it processed into a done directory. In the mean time, new files may have arrived. After another time interval, another round of process B starts. The major limitation of this model is that it requires that your process work incrementally, or that you are emitting a small enough volume of data each time in process B that subsequent iterations can load into memory a summary table of results from previous iterations. Look into using the DistributedCache to disseminate such files. Also, why are you using two MapReduce clusters for this, as opposed to one? Is there a common HDFS cluster behind them? You'll probably get much better performance for the overall process if the output data from one job does not need to be transferred to another cluster before it is further processed. Does this model make sense? - Aaron On Tue, Apr 7, 2009 at 1:06 AM, Mithila Nagendra mnage...@asu.edu wrote: Aaron, We hope to achieve a level of pipelining between two clusters - similar to how pipelining is done in executing RDB queries. You can look at it as the producer-consumer problem, one cluster produces some data and the other cluster consumes it. The issue that has to be dealt with here is the data exchange between the clusters - synchronized interaction between the map-reduce jobs on the two clusters is what I m hoping to achieve. Mithila On Tue, Apr 7, 2009 at 10:10 AM, Aaron Kimball aa...@cloudera.com wrote: Clusters don't really have identities beyond the addresses of the NameNodes and JobTrackers. In the example below, nn1 and nn2 are the hostnames of the namenodes of the source and destination clusters. The 8020 in each address assumes that they're on the default port. Hadoop provides no inter-task or inter-job synchronization primitives, on purpose (even within a cluster, the most you get in terms of synchronization is the ability to join on the status of a running job to determine that it's completed). The model is designed to be as identity-independent as possible to make it more resiliant to failure. If individual jobs/tasks could lock common resources, then the intermittent failure of tasks could easily cause deadlock. Using a file as a scoreboard or other communication mechanism between multiple jobs is not something explicitly designed for, and likely to end in frustration. Can you describe the goal you're trying to accomplish? It's likely that there's another, more MapReduce-y way of looking at the job and refactoring the code to make it work more cleanly with the intended programming model. - Aaron On Mon, Apr 6, 2009 at 10:08 PM, Mithila Nagendra mnage...@asu.edu wrote: Thanks! I was looking at the link sent by Philip. The copy is done with the following command: hadoop distcp hdfs://nn1:8020/foo/bar \ hdfs://nn2:8020/bar/foo I was wondering if nn1 and nn2 are the names of the clusters or the name of the masters on each cluster. I wanted map/reduce tasks running on each of the two clusters to communicate with each other. I dont know if hadoop provides for synchronization between two map/reduce tasks. The tasks run simultaneouly, and they need to access a common file - something like a
Re: connecting two clusters
Hi, I m not sure if u have looked at this option. But instead of having two HDFS , u can have one HDFS and two map-red clusters (pointing to same HDFS) and then do the sync mechanisms -Sagar Mithila Nagendra wrote: Hello Aaron Yes it makes a lot of sense! Thank you! :) The incremental wavefront model is another option we are looking at. Currently we have a two map/reduce levels, the upper level has to wait until the lower map/reduce has produced the entire result set. We want to avoid this... We were thinking of using two separate clusters so that these levels can run on them - hoping to achieve better resource utilization. We were hoping to connect the two clusters in some way so that the processes can interact - but it seems like Hadoop is limited in that sense. I was wondering how a common HDFS system can be setup for this purpose. I tried looking for material for synchronization between two map-reduce clusters - there is limited/no data available out on the Web! If we stick to the incremental wavefront model, then we could probably work with one cluster. Mithila On Tue, Apr 7, 2009 at 7:05 PM, Aaron Kimball aa...@cloudera.com wrote: Hi Mithila, Unfortunately, Hadoop MapReduce jobs determine their inputs as soon as they begin; the inputs for the job are then fixed. So additional files that arrive in the input directory after processing has begun, etc, do not participate in the job. And HDFS does not currently support appends to files, so existing files cannot be updated. A typical way in which this sort of problem is handled is to do processing in incremental wavefronts; process A generates some data which goes in an incoming directory for process B; process B starts on a timer every so often and collects the new input files and works on them. After it's done, it moves those inputs which it processed into a done directory. In the mean time, new files may have arrived. After another time interval, another round of process B starts. The major limitation of this model is that it requires that your process work incrementally, or that you are emitting a small enough volume of data each time in process B that subsequent iterations can load into memory a summary table of results from previous iterations. Look into using the DistributedCache to disseminate such files. Also, why are you using two MapReduce clusters for this, as opposed to one? Is there a common HDFS cluster behind them? You'll probably get much better performance for the overall process if the output data from one job does not need to be transferred to another cluster before it is further processed. Does this model make sense? - Aaron On Tue, Apr 7, 2009 at 1:06 AM, Mithila Nagendra mnage...@asu.edu wrote: Aaron, We hope to achieve a level of pipelining between two clusters - similar to how pipelining is done in executing RDB queries. You can look at it as the producer-consumer problem, one cluster produces some data and the other cluster consumes it. The issue that has to be dealt with here is the data exchange between the clusters - synchronized interaction between the map-reduce jobs on the two clusters is what I m hoping to achieve. Mithila On Tue, Apr 7, 2009 at 10:10 AM, Aaron Kimball aa...@cloudera.com wrote: Clusters don't really have identities beyond the addresses of the NameNodes and JobTrackers. In the example below, nn1 and nn2 are the hostnames of the namenodes of the source and destination clusters. The 8020 in each address assumes that they're on the default port. Hadoop provides no inter-task or inter-job synchronization primitives, on purpose (even within a cluster, the most you get in terms of synchronization is the ability to join on the status of a running job to determine that it's completed). The model is designed to be as identity-independent as possible to make it more resiliant to failure. If individual jobs/tasks could lock common resources, then the intermittent failure of tasks could easily cause deadlock. Using a file as a scoreboard or other communication mechanism between multiple jobs is not something explicitly designed for, and likely to end in frustration. Can you describe the goal you're trying to accomplish? It's likely that there's another, more MapReduce-y way of looking at the job and refactoring the code to make it work more cleanly with the intended programming model. - Aaron On Mon, Apr 6, 2009 at 10:08 PM, Mithila Nagendra mnage...@asu.edu wrote: Thanks! I was looking at the link sent by Philip. The copy is done with the following command: hadoop distcp hdfs://nn1:8020/foo/bar \ hdfs://nn2:8020/bar/foo I was wondering if nn1 and nn2 are the names of the clusters or the name of the
connecting two clusters
Hey all I'm trying to connect two separate Hadoop clusters. Is it possible to do so? I need data to be shuttled back and forth between the two clusters. Any suggestions? Thank you! Mithila Nagendra Arizona State University
Re: connecting two clusters
DistCp is the standard way to copy data between clusters. What it does is run a mapreduce job to copy data between a source cluster and a destination cluster. See http://hadoop.apache.org/core/docs/r0.19.1/distcp.html On Mon, Apr 6, 2009 at 9:49 PM, Mithila Nagendra mnage...@asu.edu wrote: Hey all I'm trying to connect two separate Hadoop clusters. Is it possible to do so? I need data to be shuttled back and forth between the two clusters. Any suggestions? Thank you! Mithila Nagendra Arizona State University
Re: connecting two clusters
On Apr 6, 2009, at 9:49 PM, Mithila Nagendra wrote: Hey all I'm trying to connect two separate Hadoop clusters. Is it possible to do so? I need data to be shuttled back and forth between the two clusters. Any suggestions? You should use hadoop distcp. It is a map/reduce program that copies data, typically from one cluster to another. If you have the hftp interface enabled, you can use that to copy between hdfs clusters that are different versions. hadoop distcp hftp://namenode1:1234/foo/bar hdfs://foo/bar -- Owen
Re: connecting two clusters
Thanks! I was looking at the link sent by Philip. The copy is done with the following command: hadoop distcp hdfs://nn1:8020/foo/bar \ hdfs://nn2:8020/bar/foo I was wondering if nn1 and nn2 are the names of the clusters or the name of the masters on each cluster. I wanted map/reduce tasks running on each of the two clusters to communicate with each other. I dont know if hadoop provides for synchronization between two map/reduce tasks. The tasks run simultaneouly, and they need to access a common file - something like a map/reduce task at a higher level utilizing the data produced by the map/reduce at the lower level. Mithila On Tue, Apr 7, 2009 at 7:57 AM, Owen O'Malley omal...@apache.org wrote: On Apr 6, 2009, at 9:49 PM, Mithila Nagendra wrote: Hey all I'm trying to connect two separate Hadoop clusters. Is it possible to do so? I need data to be shuttled back and forth between the two clusters. Any suggestions? You should use hadoop distcp. It is a map/reduce program that copies data, typically from one cluster to another. If you have the hftp interface enabled, you can use that to copy between hdfs clusters that are different versions. hadoop distcp hftp://namenode1:1234/foo/bar hdfs://foo/bar -- Owen