Aaron, We hope to achieve a level of pipelining between two clusters - similar to how pipelining is done in executing RDB queries. You can look at it as the producer-consumer problem, one cluster produces some data and the other cluster consumes it. The issue that has to be dealt with here is the data exchange between the clusters - synchronized interaction between the map-reduce jobs on the two clusters is what I m hoping to achieve.
Mithila On Tue, Apr 7, 2009 at 10:10 AM, Aaron Kimball <[email protected]> wrote: > Clusters don't really have identities beyond the addresses of the NameNodes > and JobTrackers. In the example below, nn1 and nn2 are the hostnames of the > namenodes of the source and destination clusters. The 8020 in each address > assumes that they're on the default port. > > Hadoop provides no inter-task or inter-job synchronization primitives, on > purpose (even within a cluster, the most you get in terms of > synchronization > is the ability to "join" on the status of a running job to determine that > it's completed). The model is designed to be as identity-independent as > possible to make it more resiliant to failure. If individual jobs/tasks > could lock common resources, then the intermittent failure of tasks could > easily cause deadlock. > > Using a file as a "scoreboard" or other communication mechanism between > multiple jobs is not something explicitly designed for, and likely to end > in > frustration. Can you describe the goal you're trying to accomplish? It's > likely that there's another, more MapReduce-y way of looking at the job and > refactoring the code to make it work more cleanly with the intended > programming model. > > - Aaron > > On Mon, Apr 6, 2009 at 10:08 PM, Mithila Nagendra <[email protected]> > wrote: > > > Thanks! I was looking at the link sent by Philip. The copy is done with > the > > following command: > > hadoop distcp hdfs://nn1:8020/foo/bar \ > > hdfs://nn2:8020/bar/foo > > > > I was wondering if nn1 and nn2 are the names of the clusters or the name > of > > the masters on each cluster. > > > > I wanted map/reduce tasks running on each of the two clusters to > > communicate > > with each other. I dont know if hadoop provides for synchronization > between > > two map/reduce tasks. The tasks run simultaneouly, and they need to > access > > a > > common file - something like a map/reduce task at a higher level > utilizing > > the data produced by the map/reduce at the lower level. > > > > Mithila > > > > On Tue, Apr 7, 2009 at 7:57 AM, Owen O'Malley <[email protected]> > wrote: > > > > > > > > On Apr 6, 2009, at 9:49 PM, Mithila Nagendra wrote: > > > > > > Hey all > > >> I'm trying to connect two separate Hadoop clusters. Is it possible to > do > > >> so? > > >> I need data to be shuttled back and forth between the two clusters. > Any > > >> suggestions? > > >> > > > > > > You should use hadoop distcp. It is a map/reduce program that copies > > data, > > > typically from one cluster to another. If you have the hftp interface > > > enabled, you can use that to copy between hdfs clusters that are > > different > > > versions. > > > > > > hadoop distcp hftp://namenode1:1234/foo/bar hdfs://foo/bar > > > > > > -- Owen > > > > > >
