Aaron,
We hope to achieve a level of pipelining between two clusters - similar to
how pipelining is done in executing RDB queries. You can look at it as the
producer-consumer problem, one cluster produces some data and the other
cluster consumes it. The issue that has to be dealt with here is the data
exchange between the clusters - synchronized interaction between the
map-reduce jobs on the two clusters is what I m hoping to achieve.

Mithila

On Tue, Apr 7, 2009 at 10:10 AM, Aaron Kimball <[email protected]> wrote:

> Clusters don't really have identities beyond the addresses of the NameNodes
> and JobTrackers. In the example below, nn1 and nn2 are the hostnames of the
> namenodes of the source and destination clusters. The 8020 in each address
> assumes that they're on the default port.
>
> Hadoop provides no inter-task or inter-job synchronization primitives, on
> purpose (even within a cluster, the most you get in terms of
> synchronization
> is the ability to "join" on the status of a running job to determine that
> it's completed). The model is designed to be as identity-independent as
> possible to make it more resiliant to failure. If individual jobs/tasks
> could lock common resources, then the intermittent failure of tasks could
> easily cause deadlock.
>
> Using a file as a "scoreboard" or other communication mechanism between
> multiple jobs is not something explicitly designed for, and likely to end
> in
> frustration. Can you describe the goal you're trying to accomplish? It's
> likely that there's another, more MapReduce-y way of looking at the job and
> refactoring the code to make it work more cleanly with the intended
> programming model.
>
> - Aaron
>
> On Mon, Apr 6, 2009 at 10:08 PM, Mithila Nagendra <[email protected]>
> wrote:
>
> > Thanks! I was looking at the link sent by Philip. The copy is done with
> the
> > following command:
> > hadoop distcp hdfs://nn1:8020/foo/bar \
> >                    hdfs://nn2:8020/bar/foo
> >
> > I was wondering if nn1 and nn2 are the names of the clusters or the name
> of
> > the masters on each cluster.
> >
> > I wanted map/reduce tasks running on each of the two clusters to
> > communicate
> > with each other. I dont know if hadoop provides for synchronization
> between
> > two map/reduce tasks. The tasks run simultaneouly, and they need to
> access
> > a
> > common file - something like a map/reduce task at a higher level
> utilizing
> > the data produced by the map/reduce at the lower level.
> >
> > Mithila
> >
> > On Tue, Apr 7, 2009 at 7:57 AM, Owen O'Malley <[email protected]>
> wrote:
> >
> > >
> > > On Apr 6, 2009, at 9:49 PM, Mithila Nagendra wrote:
> > >
> > >  Hey all
> > >> I'm trying to connect two separate Hadoop clusters. Is it possible to
> do
> > >> so?
> > >> I need data to be shuttled back and forth between the two clusters.
> Any
> > >> suggestions?
> > >>
> > >
> > > You should use hadoop distcp. It is a map/reduce program that copies
> > data,
> > > typically from one cluster to another. If you have the hftp interface
> > > enabled, you can use that to copy between hdfs clusters that are
> > different
> > > versions.
> > >
> > > hadoop distcp hftp://namenode1:1234/foo/bar hdfs://foo/bar
> > >
> > > -- Owen
> > >
> >
>

Reply via email to