If your workload is only batch processing (MR), you don't need to separate the 
clusters in the first place. So, you don't have the problem of moving large 
amounts of data between clusters.
Having a common HDFS cluster and using part of the nodes as HBase RS and part 
as the Hadoop TTs doesn't solve the problem of moving data from the HBase RS to 
the tasks you'll run as a part of your MR jobs if HBase is your source/sink. 
You will still be reading/writing over the network.

On the other hand, if your workload is 'realtime' random reads/writes, the 
amount of data you are likely going to be accessing is small and therefore not 
expensive. Moreover, that's going to be accessed from a client application of 
some sort that is not a MR job.


On Wednesday, June 6, 2012 at 12:23 PM, Atif Khan wrote:

> This is beginning to sound like a catch-22 problem. I think I personally
> would lean towards a single HDFS (high performing) cluster that can be
> shared between various types of applications (realtime vs analytics). Then
> control/balance resource requirements for each application. This would work
> for scenarios where I can predict the different types of
> applications/workloads before hand. However, if for some reason the nature
> of workload is to shift, that could potentially throw off the whole resource
> equilibrium.
> 
> Are there any additional Hadoop specific monitoring tools that can be
> deployed to predict resource/performance bottlenecks in advance (in addition
> to regular BMC type tools)?
> 
> --
> View this message in context: 
> http://apache-hbase.679495.n3.nabble.com/Shared-HDFS-for-HBase-and-MapReduce-tp4018856p4018881.html
> Sent from the HBase - Developer mailing list archive at Nabble.com 
> (http://Nabble.com).
> 
> 


Reply via email to