Re: connecting two clusters

2009-04-07 Thread Aaron Kimball
Clusters don't really have identities beyond the addresses of the NameNodes
and JobTrackers. In the example below, nn1 and nn2 are the hostnames of the
namenodes of the source and destination clusters. The 8020 in each address
assumes that they're on the default port.

Hadoop provides no inter-task or inter-job synchronization primitives, on
purpose (even within a cluster, the most you get in terms of synchronization
is the ability to join on the status of a running job to determine that
it's completed). The model is designed to be as identity-independent as
possible to make it more resiliant to failure. If individual jobs/tasks
could lock common resources, then the intermittent failure of tasks could
easily cause deadlock.

Using a file as a scoreboard or other communication mechanism between
multiple jobs is not something explicitly designed for, and likely to end in
frustration. Can you describe the goal you're trying to accomplish? It's
likely that there's another, more MapReduce-y way of looking at the job and
refactoring the code to make it work more cleanly with the intended
programming model.

- Aaron

On Mon, Apr 6, 2009 at 10:08 PM, Mithila Nagendra mnage...@asu.edu wrote:

 Thanks! I was looking at the link sent by Philip. The copy is done with the
 following command:
 hadoop distcp hdfs://nn1:8020/foo/bar \
hdfs://nn2:8020/bar/foo

 I was wondering if nn1 and nn2 are the names of the clusters or the name of
 the masters on each cluster.

 I wanted map/reduce tasks running on each of the two clusters to
 communicate
 with each other. I dont know if hadoop provides for synchronization between
 two map/reduce tasks. The tasks run simultaneouly, and they need to access
 a
 common file - something like a map/reduce task at a higher level utilizing
 the data produced by the map/reduce at the lower level.

 Mithila

 On Tue, Apr 7, 2009 at 7:57 AM, Owen O'Malley omal...@apache.org wrote:

 
  On Apr 6, 2009, at 9:49 PM, Mithila Nagendra wrote:
 
   Hey all
  I'm trying to connect two separate Hadoop clusters. Is it possible to do
  so?
  I need data to be shuttled back and forth between the two clusters. Any
  suggestions?
 
 
  You should use hadoop distcp. It is a map/reduce program that copies
 data,
  typically from one cluster to another. If you have the hftp interface
  enabled, you can use that to copy between hdfs clusters that are
 different
  versions.
 
  hadoop distcp hftp://namenode1:1234/foo/bar hdfs://foo/bar
 
  -- Owen
 



Re: connecting two clusters

2009-04-07 Thread Mithila Nagendra
Aaron,
We hope to achieve a level of pipelining between two clusters - similar to
how pipelining is done in executing RDB queries. You can look at it as the
producer-consumer problem, one cluster produces some data and the other
cluster consumes it. The issue that has to be dealt with here is the data
exchange between the clusters - synchronized interaction between the
map-reduce jobs on the two clusters is what I m hoping to achieve.

Mithila

On Tue, Apr 7, 2009 at 10:10 AM, Aaron Kimball aa...@cloudera.com wrote:

 Clusters don't really have identities beyond the addresses of the NameNodes
 and JobTrackers. In the example below, nn1 and nn2 are the hostnames of the
 namenodes of the source and destination clusters. The 8020 in each address
 assumes that they're on the default port.

 Hadoop provides no inter-task or inter-job synchronization primitives, on
 purpose (even within a cluster, the most you get in terms of
 synchronization
 is the ability to join on the status of a running job to determine that
 it's completed). The model is designed to be as identity-independent as
 possible to make it more resiliant to failure. If individual jobs/tasks
 could lock common resources, then the intermittent failure of tasks could
 easily cause deadlock.

 Using a file as a scoreboard or other communication mechanism between
 multiple jobs is not something explicitly designed for, and likely to end
 in
 frustration. Can you describe the goal you're trying to accomplish? It's
 likely that there's another, more MapReduce-y way of looking at the job and
 refactoring the code to make it work more cleanly with the intended
 programming model.

 - Aaron

 On Mon, Apr 6, 2009 at 10:08 PM, Mithila Nagendra mnage...@asu.edu
 wrote:

  Thanks! I was looking at the link sent by Philip. The copy is done with
 the
  following command:
  hadoop distcp hdfs://nn1:8020/foo/bar \
 hdfs://nn2:8020/bar/foo
 
  I was wondering if nn1 and nn2 are the names of the clusters or the name
 of
  the masters on each cluster.
 
  I wanted map/reduce tasks running on each of the two clusters to
  communicate
  with each other. I dont know if hadoop provides for synchronization
 between
  two map/reduce tasks. The tasks run simultaneouly, and they need to
 access
  a
  common file - something like a map/reduce task at a higher level
 utilizing
  the data produced by the map/reduce at the lower level.
 
  Mithila
 
  On Tue, Apr 7, 2009 at 7:57 AM, Owen O'Malley omal...@apache.org
 wrote:
 
  
   On Apr 6, 2009, at 9:49 PM, Mithila Nagendra wrote:
  
Hey all
   I'm trying to connect two separate Hadoop clusters. Is it possible to
 do
   so?
   I need data to be shuttled back and forth between the two clusters.
 Any
   suggestions?
  
  
   You should use hadoop distcp. It is a map/reduce program that copies
  data,
   typically from one cluster to another. If you have the hftp interface
   enabled, you can use that to copy between hdfs clusters that are
  different
   versions.
  
   hadoop distcp hftp://namenode1:1234/foo/bar hdfs://foo/bar
  
   -- Owen
  
 



Re: connecting two clusters

2009-04-07 Thread Aaron Kimball
Hi Mithila,

Unfortunately, Hadoop MapReduce jobs determine their inputs as soon as they
begin; the inputs for the job are then fixed. So additional files that
arrive in the input directory after processing has begun, etc, do not
participate in the job.

And HDFS does not currently support appends to files, so existing files
cannot be updated.

A typical way in which this sort of problem is handled is to do processing
in incremental wavefronts; process A generates some data which goes in an
incoming directory for process B; process B starts on a timer every so
often and collects the new input files and works on them. After it's done,
it moves those inputs which it processed into a done directory. In the
mean time, new files may have arrived. After another time interval, another
round of process B starts.  The major limitation of this model is that it
requires that your process work incrementally, or that you are emitting a
small enough volume of data each time in process B that subsequent
iterations can load into memory a summary table of results from previous
iterations. Look into using the DistributedCache to disseminate such files.

Also, why are you using two MapReduce clusters for this, as opposed to one?
Is there a common HDFS cluster behind them?  You'll probably get much better
performance for the overall process if the output data from one job does not
need to be transferred to another cluster before it is further processed.

Does this model make sense?
- Aaron

On Tue, Apr 7, 2009 at 1:06 AM, Mithila Nagendra mnage...@asu.edu wrote:

 Aaron,
 We hope to achieve a level of pipelining between two clusters - similar to
 how pipelining is done in executing RDB queries. You can look at it as the
 producer-consumer problem, one cluster produces some data and the other
 cluster consumes it. The issue that has to be dealt with here is the data
 exchange between the clusters - synchronized interaction between the
 map-reduce jobs on the two clusters is what I m hoping to achieve.

 Mithila

 On Tue, Apr 7, 2009 at 10:10 AM, Aaron Kimball aa...@cloudera.com wrote:

  Clusters don't really have identities beyond the addresses of the
 NameNodes
  and JobTrackers. In the example below, nn1 and nn2 are the hostnames of
 the
  namenodes of the source and destination clusters. The 8020 in each
 address
  assumes that they're on the default port.
 
  Hadoop provides no inter-task or inter-job synchronization primitives, on
  purpose (even within a cluster, the most you get in terms of
  synchronization
  is the ability to join on the status of a running job to determine that
  it's completed). The model is designed to be as identity-independent as
  possible to make it more resiliant to failure. If individual jobs/tasks
  could lock common resources, then the intermittent failure of tasks could
  easily cause deadlock.
 
  Using a file as a scoreboard or other communication mechanism between
  multiple jobs is not something explicitly designed for, and likely to end
  in
  frustration. Can you describe the goal you're trying to accomplish? It's
  likely that there's another, more MapReduce-y way of looking at the job
 and
  refactoring the code to make it work more cleanly with the intended
  programming model.
 
  - Aaron
 
  On Mon, Apr 6, 2009 at 10:08 PM, Mithila Nagendra mnage...@asu.edu
  wrote:
 
   Thanks! I was looking at the link sent by Philip. The copy is done with
  the
   following command:
   hadoop distcp hdfs://nn1:8020/foo/bar \
  hdfs://nn2:8020/bar/foo
  
   I was wondering if nn1 and nn2 are the names of the clusters or the
 name
  of
   the masters on each cluster.
  
   I wanted map/reduce tasks running on each of the two clusters to
   communicate
   with each other. I dont know if hadoop provides for synchronization
  between
   two map/reduce tasks. The tasks run simultaneouly, and they need to
  access
   a
   common file - something like a map/reduce task at a higher level
  utilizing
   the data produced by the map/reduce at the lower level.
  
   Mithila
  
   On Tue, Apr 7, 2009 at 7:57 AM, Owen O'Malley omal...@apache.org
  wrote:
  
   
On Apr 6, 2009, at 9:49 PM, Mithila Nagendra wrote:
   
 Hey all
I'm trying to connect two separate Hadoop clusters. Is it possible
 to
  do
so?
I need data to be shuttled back and forth between the two clusters.
  Any
suggestions?
   
   
You should use hadoop distcp. It is a map/reduce program that copies
   data,
typically from one cluster to another. If you have the hftp interface
enabled, you can use that to copy between hdfs clusters that are
   different
versions.
   
hadoop distcp hftp://namenode1:1234/foo/bar hdfs://foo/bar
   
-- Owen
   
  
 



Re: connecting two clusters

2009-04-07 Thread Mithila Nagendra
Hello Aaron
Yes it makes a lot of sense! Thank you! :)

The incremental wavefront model is another option we are looking at.
Currently we have a two map/reduce levels, the upper level has to wait until
the lower map/reduce has produced the entire result set. We want to avoid
this... We were thinking of using two separate clusters so that these levels
can run on them - hoping to achieve better resource utilization. We were
hoping to connect the two clusters in some way so that the processes can
interact - but it seems like Hadoop is limited in that sense. I was
wondering how a common HDFS system can be setup for this purpose.

I tried looking for material for synchronization between two map-reduce
clusters - there is limited/no data available out on the Web! If we stick to
the incremental wavefront model, then we could probably work with one
cluster.

Mithila

On Tue, Apr 7, 2009 at 7:05 PM, Aaron Kimball aa...@cloudera.com wrote:

 Hi Mithila,

 Unfortunately, Hadoop MapReduce jobs determine their inputs as soon as they
 begin; the inputs for the job are then fixed. So additional files that
 arrive in the input directory after processing has begun, etc, do not
 participate in the job.

 And HDFS does not currently support appends to files, so existing files
 cannot be updated.

 A typical way in which this sort of problem is handled is to do processing
 in incremental wavefronts; process A generates some data which goes in an
 incoming directory for process B; process B starts on a timer every so
 often and collects the new input files and works on them. After it's done,
 it moves those inputs which it processed into a done directory. In the
 mean time, new files may have arrived. After another time interval, another
 round of process B starts.  The major limitation of this model is that it
 requires that your process work incrementally, or that you are emitting a
 small enough volume of data each time in process B that subsequent
 iterations can load into memory a summary table of results from previous
 iterations. Look into using the DistributedCache to disseminate such files.

 Also, why are you using two MapReduce clusters for this, as opposed to one?
 Is there a common HDFS cluster behind them?  You'll probably get much
 better
 performance for the overall process if the output data from one job does
 not
 need to be transferred to another cluster before it is further processed.

 Does this model make sense?
 - Aaron

 On Tue, Apr 7, 2009 at 1:06 AM, Mithila Nagendra mnage...@asu.edu wrote:

  Aaron,
  We hope to achieve a level of pipelining between two clusters - similar
 to
  how pipelining is done in executing RDB queries. You can look at it as
 the
  producer-consumer problem, one cluster produces some data and the other
  cluster consumes it. The issue that has to be dealt with here is the data
  exchange between the clusters - synchronized interaction between the
  map-reduce jobs on the two clusters is what I m hoping to achieve.
 
  Mithila
 
  On Tue, Apr 7, 2009 at 10:10 AM, Aaron Kimball aa...@cloudera.com
 wrote:
 
   Clusters don't really have identities beyond the addresses of the
  NameNodes
   and JobTrackers. In the example below, nn1 and nn2 are the hostnames of
  the
   namenodes of the source and destination clusters. The 8020 in each
  address
   assumes that they're on the default port.
  
   Hadoop provides no inter-task or inter-job synchronization primitives,
 on
   purpose (even within a cluster, the most you get in terms of
   synchronization
   is the ability to join on the status of a running job to determine
 that
   it's completed). The model is designed to be as identity-independent as
   possible to make it more resiliant to failure. If individual jobs/tasks
   could lock common resources, then the intermittent failure of tasks
 could
   easily cause deadlock.
  
   Using a file as a scoreboard or other communication mechanism between
   multiple jobs is not something explicitly designed for, and likely to
 end
   in
   frustration. Can you describe the goal you're trying to accomplish?
 It's
   likely that there's another, more MapReduce-y way of looking at the job
  and
   refactoring the code to make it work more cleanly with the intended
   programming model.
  
   - Aaron
  
   On Mon, Apr 6, 2009 at 10:08 PM, Mithila Nagendra mnage...@asu.edu
   wrote:
  
Thanks! I was looking at the link sent by Philip. The copy is done
 with
   the
following command:
hadoop distcp hdfs://nn1:8020/foo/bar \
   hdfs://nn2:8020/bar/foo
   
I was wondering if nn1 and nn2 are the names of the clusters or the
  name
   of
the masters on each cluster.
   
I wanted map/reduce tasks running on each of the two clusters to
communicate
with each other. I dont know if hadoop provides for synchronization
   between
two map/reduce tasks. The tasks run simultaneouly, and they need to
   access
a
common file - something like a 

Re: connecting two clusters

2009-04-07 Thread Sagar Naik

Hi,
I m not sure if u have looked at this option.
But instead of having two HDFS , u can have one HDFS and two map-red 
clusters (pointing to same HDFS)  and then do the sync mechanisms


-Sagar

Mithila Nagendra wrote:

Hello Aaron
Yes it makes a lot of sense! Thank you! :)

The incremental wavefront model is another option we are looking at.
Currently we have a two map/reduce levels, the upper level has to wait until
the lower map/reduce has produced the entire result set. We want to avoid
this... We were thinking of using two separate clusters so that these levels
can run on them - hoping to achieve better resource utilization. We were
hoping to connect the two clusters in some way so that the processes can
interact - but it seems like Hadoop is limited in that sense. I was
wondering how a common HDFS system can be setup for this purpose.

I tried looking for material for synchronization between two map-reduce
clusters - there is limited/no data available out on the Web! If we stick to
the incremental wavefront model, then we could probably work with one
cluster.

Mithila

On Tue, Apr 7, 2009 at 7:05 PM, Aaron Kimball aa...@cloudera.com wrote:

  

Hi Mithila,

Unfortunately, Hadoop MapReduce jobs determine their inputs as soon as they
begin; the inputs for the job are then fixed. So additional files that
arrive in the input directory after processing has begun, etc, do not
participate in the job.

And HDFS does not currently support appends to files, so existing files
cannot be updated.

A typical way in which this sort of problem is handled is to do processing
in incremental wavefronts; process A generates some data which goes in an
incoming directory for process B; process B starts on a timer every so
often and collects the new input files and works on them. After it's done,
it moves those inputs which it processed into a done directory. In the
mean time, new files may have arrived. After another time interval, another
round of process B starts.  The major limitation of this model is that it
requires that your process work incrementally, or that you are emitting a
small enough volume of data each time in process B that subsequent
iterations can load into memory a summary table of results from previous
iterations. Look into using the DistributedCache to disseminate such files.

Also, why are you using two MapReduce clusters for this, as opposed to one?
Is there a common HDFS cluster behind them?  You'll probably get much
better
performance for the overall process if the output data from one job does
not
need to be transferred to another cluster before it is further processed.

Does this model make sense?
- Aaron

On Tue, Apr 7, 2009 at 1:06 AM, Mithila Nagendra mnage...@asu.edu wrote:



Aaron,
We hope to achieve a level of pipelining between two clusters - similar
  

to


how pipelining is done in executing RDB queries. You can look at it as
  

the


producer-consumer problem, one cluster produces some data and the other
cluster consumes it. The issue that has to be dealt with here is the data
exchange between the clusters - synchronized interaction between the
map-reduce jobs on the two clusters is what I m hoping to achieve.

Mithila

On Tue, Apr 7, 2009 at 10:10 AM, Aaron Kimball aa...@cloudera.com
  

wrote:


Clusters don't really have identities beyond the addresses of the


NameNodes
  

and JobTrackers. In the example below, nn1 and nn2 are the hostnames of


the
  

namenodes of the source and destination clusters. The 8020 in each


address
  

assumes that they're on the default port.

Hadoop provides no inter-task or inter-job synchronization primitives,


on


purpose (even within a cluster, the most you get in terms of
synchronization
is the ability to join on the status of a running job to determine


that


it's completed). The model is designed to be as identity-independent as
possible to make it more resiliant to failure. If individual jobs/tasks
could lock common resources, then the intermittent failure of tasks


could


easily cause deadlock.

Using a file as a scoreboard or other communication mechanism between
multiple jobs is not something explicitly designed for, and likely to


end


in
frustration. Can you describe the goal you're trying to accomplish?


It's


likely that there's another, more MapReduce-y way of looking at the job


and
  

refactoring the code to make it work more cleanly with the intended
programming model.

- Aaron

On Mon, Apr 6, 2009 at 10:08 PM, Mithila Nagendra mnage...@asu.edu
wrote:



Thanks! I was looking at the link sent by Philip. The copy is done
  

with


the


following command:
hadoop distcp hdfs://nn1:8020/foo/bar \
   hdfs://nn2:8020/bar/foo

I was wondering if nn1 and nn2 are the names of the clusters or the
  

name
  

of


the 

connecting two clusters

2009-04-06 Thread Mithila Nagendra
Hey all
I'm trying to connect two separate Hadoop clusters. Is it possible to do so?
I need data to be shuttled back and forth between the two clusters. Any
suggestions?

Thank you!
Mithila Nagendra
Arizona State University


Re: connecting two clusters

2009-04-06 Thread Philip Zeyliger
DistCp is the standard way to copy data between clusters.  What it does is
run a mapreduce job to copy data between a source cluster and a destination
cluster.  See http://hadoop.apache.org/core/docs/r0.19.1/distcp.html

On Mon, Apr 6, 2009 at 9:49 PM, Mithila Nagendra mnage...@asu.edu wrote:

 Hey all
 I'm trying to connect two separate Hadoop clusters. Is it possible to do
 so?
 I need data to be shuttled back and forth between the two clusters. Any
 suggestions?

 Thank you!
 Mithila Nagendra
 Arizona State University



Re: connecting two clusters

2009-04-06 Thread Owen O'Malley


On Apr 6, 2009, at 9:49 PM, Mithila Nagendra wrote:


Hey all
I'm trying to connect two separate Hadoop clusters. Is it possible  
to do so?
I need data to be shuttled back and forth between the two clusters.  
Any

suggestions?


You should use hadoop distcp. It is a map/reduce program that copies  
data, typically from one cluster to another. If you have the hftp  
interface enabled, you can use that to copy between hdfs clusters that  
are different versions.


hadoop distcp hftp://namenode1:1234/foo/bar hdfs://foo/bar

-- Owen


Re: connecting two clusters

2009-04-06 Thread Mithila Nagendra
Thanks! I was looking at the link sent by Philip. The copy is done with the
following command:
hadoop distcp hdfs://nn1:8020/foo/bar \
hdfs://nn2:8020/bar/foo

I was wondering if nn1 and nn2 are the names of the clusters or the name of
the masters on each cluster.

I wanted map/reduce tasks running on each of the two clusters to communicate
with each other. I dont know if hadoop provides for synchronization between
two map/reduce tasks. The tasks run simultaneouly, and they need to access a
common file - something like a map/reduce task at a higher level utilizing
the data produced by the map/reduce at the lower level.

Mithila

On Tue, Apr 7, 2009 at 7:57 AM, Owen O'Malley omal...@apache.org wrote:


 On Apr 6, 2009, at 9:49 PM, Mithila Nagendra wrote:

  Hey all
 I'm trying to connect two separate Hadoop clusters. Is it possible to do
 so?
 I need data to be shuttled back and forth between the two clusters. Any
 suggestions?


 You should use hadoop distcp. It is a map/reduce program that copies data,
 typically from one cluster to another. If you have the hftp interface
 enabled, you can use that to copy between hdfs clusters that are different
 versions.

 hadoop distcp hftp://namenode1:1234/foo/bar hdfs://foo/bar

 -- Owen