Hi Andy, Thanks for the tip. I have a EC2 cluster with 6 nodes. Each a server grade large instance. I have the mapred & regionservers running on all the nodes. Our deployment will not go beyond 20 clusters in the near future. What would you suggest me to have? Scenario 1 or 2 as u mentioned ?
On Thu, May 14, 2009 at 10:44 PM, Andrew Purtell <[email protected]>wrote: > Hi Ninad, > > I think the answer depends on the anticipated scale of the deployment. > > For small clusters (up to a few racks, ~40 servers per rack) I don't think > there is any significant performance hit to separate storage and > computation. Presumably all servers will share the same large GigE switch -- > or maybe a redundant L2 pair via bonded interfaces for fail over -- or a few > of them stacked with high speed interconnects. This would relieve the > storage nodes of RAM and CPU burden related to the computational tasks as > you are thinking, providing more headroom in exchange for some quite modest > performance penalty. (However, if your computation load is high and > therefore the nodes are overburdened and are not stable, there is no > alternative...) In the future this consideration might change if DFS clients > are given some capability to find blocks on local disk via some optimized > I/O path. > > In a large cluster there might well be significant performance impact. In a > common deployment scenario, there are rack-local switched fabrics and > another switched fabric for uplinks from the racks. So, a rack would have a > switched GigE backplane or similar, but inter-rack connections might be > single GigE uplinks, a ~40-to-1 reduction in capacity worst case; or maybe > 10 GigE uplinks, a ~10-1 reduction. Therefore it would be desirable to > distribute the computation into the racks where the data is located. When a > region is deployed to a region server the underlying blocks on DFS are not > immediately migrated, but always after a compaction -- a rewrite -- the > underlying blocks will be available on rack local data nodes, according to > my understanding of how DFS places replicas upon write. So, after a split, > daughter regions will have their blocks appropriately located in a timely > manner. For the rest I wonder if it would be beneficial to consider > scheduling > major compaction more frequently than the 24 hour default for datacenter > scale deployments, something like 8 hours, and you might also consider > triggering a major compaction on important tables after cluster (re)init. > Region deployment in a system in steady state should have relatively little > churn so this will have the effect of optimizing block placement for region > store access. > > Submitted for your consideration, > > - Andy > > > > > > > ________________________________ > From: Ninad Raut <[email protected]> > To: hbase-user <[email protected]> > Cc: Ranjit Nair <[email protected]> > Sent: Thursday, May 14, 2009 2:56:04 AM > Subject: Keeping Compute Nodes seperate from the region server node-- pros > and cons > > Hi, > I want to get a design perspective here as to what will be the advantages > of > seperating region servers and compute node(to run mapreduce tasks) > Will seperating datanodes from computes node reduce the load on the servers > and avoid swapping problems? > Will this seperation make map reduce tasks less efficient , since we are > doing away with localization issues? > Regards, > Ninad > > > > >
