Re: connecting two clusters

Aaron Kimball Tue, 07 Apr 2009 00:19:42 -0700

Clusters don't really have identities beyond the addresses of the NameNodes
and JobTrackers. In the example below, nn1 and nn2 are the hostnames of the
namenodes of the source and destination clusters. The 8020 in each address
assumes that they're on the default port.

Hadoop provides no inter-task or inter-job synchronization primitives, on
purpose (even within a cluster, the most you get in terms of synchronization
is the ability to "join" on the status of a running job to determine that
it's completed). The model is designed to be as identity-independent as
possible to make it more resiliant to failure. If individual jobs/tasks
could lock common resources, then the intermittent failure of tasks could
easily cause deadlock.

Using a file as a "scoreboard" or other communication mechanism between
multiple jobs is not something explicitly designed for, and likely to end in
frustration. Can you describe the goal you're trying to accomplish? It's
likely that there's another, more MapReduce-y way of looking at the job and
refactoring the code to make it work more cleanly with the intended
programming model.

- Aaron

On Mon, Apr 6, 2009 at 10:08 PM, Mithila Nagendra <[email protected]> wrote:

> Thanks! I was looking at the link sent by Philip. The copy is done with the
> following command:
> hadoop distcp hdfs://nn1:8020/foo/bar \
>                    hdfs://nn2:8020/bar/foo
>
> I was wondering if nn1 and nn2 are the names of the clusters or the name of
> the masters on each cluster.
>
> I wanted map/reduce tasks running on each of the two clusters to
> communicate
> with each other. I dont know if hadoop provides for synchronization between
> two map/reduce tasks. The tasks run simultaneouly, and they need to access
> a
> common file - something like a map/reduce task at a higher level utilizing
> the data produced by the map/reduce at the lower level.
>
> Mithila
>
> On Tue, Apr 7, 2009 at 7:57 AM, Owen O'Malley <[email protected]> wrote:
>
> >
> > On Apr 6, 2009, at 9:49 PM, Mithila Nagendra wrote:
> >
> >  Hey all
> >> I'm trying to connect two separate Hadoop clusters. Is it possible to do
> >> so?
> >> I need data to be shuttled back and forth between the two clusters. Any
> >> suggestions?
> >>
> >
> > You should use hadoop distcp. It is a map/reduce program that copies
> data,
> > typically from one cluster to another. If you have the hftp interface
> > enabled, you can use that to copy between hdfs clusters that are
> different
> > versions.
> >
> > hadoop distcp hftp://namenode1:1234/foo/bar hdfs://foo/bar
> >
> > -- Owen
> >
>

Re: connecting two clusters

Reply via email to