Hi Ninad,

I think the answer depends on the anticipated scale of the deployment. 

For small clusters (up to a few racks, ~40 servers per rack) I don't think 
there is any significant performance hit to separate storage and computation. 
Presumably all servers will share the same large GigE switch -- or maybe a 
redundant L2 pair via bonded interfaces for fail over -- or a few of them 
stacked with high speed interconnects. This would relieve the storage nodes of 
RAM and CPU burden related to the computational tasks as you are thinking, 
providing more headroom in exchange for some quite modest performance penalty. 
(However, if your computation load is high and therefore the nodes are 
overburdened and are not stable, there is no alternative...) In the future this 
consideration might change if DFS clients are given some capability to find 
blocks on local disk via some optimized I/O path. 

In a large cluster there might well be significant performance impact. In a 
common deployment scenario, there are rack-local switched fabrics and another 
switched fabric for uplinks from the racks. So, a rack would have a switched 
GigE backplane or similar, but inter-rack connections might be single GigE 
uplinks, a ~40-to-1 reduction in capacity worst case; or maybe 10 GigE uplinks, 
a ~10-1 reduction. Therefore it would be desirable to distribute the 
computation into the racks where the data is located. When a region is deployed 
to a region server the underlying blocks on DFS are not immediately migrated, 
but always after a compaction -- a rewrite -- the underlying blocks will be 
available on rack local data nodes, according to my understanding of how DFS 
places replicas upon write. So, after a split, daughter regions will have their 
blocks appropriately located in a timely manner. For the rest I wonder if it 
would be beneficial to consider scheduling
 major compaction more frequently than the 24 hour default for datacenter scale 
deployments, something like 8 hours, and you might also consider triggering a 
major compaction on important tables after cluster (re)init. Region deployment 
in a system in steady state should have relatively little churn so this will 
have the effect of optimizing block placement for region store access.

Submitted for your consideration,

    - Andy






________________________________
From: Ninad Raut <[email protected]>
To: hbase-user <[email protected]>
Cc: Ranjit Nair <[email protected]>
Sent: Thursday, May 14, 2009 2:56:04 AM
Subject: Keeping Compute Nodes seperate from the region server node-- pros and  
cons

Hi,
I want to get a design perspective here as to what will be the advantages of
seperating region servers and compute node(to run mapreduce tasks)
Will seperating datanodes from computes node reduce the load on the servers
and avoid swapping problems?
Will this seperation make map reduce tasks less efficient , since we are
doing away with localization issues?
Regards,
Ninad



      

Reply via email to