Hi Andy, I am using EC2 cluster with large server grade machines. Hense, availability cannot be determined as the cluster nodes can change ip overtime. Yes, I am interested in replication. What should be a ideal design in this case?
On Fri, May 15, 2009 at 9:33 PM, Andrew Purtell <[email protected]> wrote: > Hi Ninad, > > I think scenario 1 is fine for your case, < 20 nodes up on EC2. > > Are you planning to deploy Hadoop+HBase clusters in more than one > availability zone? Interested in or implementing replication between? > > Best regards, > > - Andy > > > > > ________________________________ > From: Ninad Raut <[email protected]> > To: [email protected] > Cc: Ranjit Nair <[email protected]> > Sent: Thursday, May 14, 2009 11:30:05 PM > Subject: Re: Keeping Compute Nodes seperate from the region server node-- > pros and cons > > Hi Andy, > Thanks for the tip. > I have a EC2 cluster with 6 nodes. Each a server grade large instance. I > have the mapred & regionservers running on all the nodes. Our deployment > will not go beyond 20 clusters in the near future. What would you suggest > me > to have? Scenario 1 or 2 as u mentioned ? > > On Thu, May 14, 2009 at 10:44 PM, Andrew Purtell <[email protected] > >wrote: > > > Hi Ninad, > > > > I think the answer depends on the anticipated scale of the deployment. > > > > For small clusters (up to a few racks, ~40 servers per rack) I don't > think > > there is any significant performance hit to separate storage and > > computation. Presumably all servers will share the same large GigE switch > -- > > or maybe a redundant L2 pair via bonded interfaces for fail over -- or a > few > > of them stacked with high speed interconnects. This would relieve the > > storage nodes of RAM and CPU burden related to the computational tasks as > > you are thinking, providing more headroom in exchange for some quite > modest > > performance penalty. (However, if your computation load is high and > > therefore the nodes are overburdened and are not stable, there is no > > alternative...) In the future this consideration might change if DFS > clients > > are given some capability to find blocks on local disk via some optimized > > I/O path. > > > > In a large cluster there might well be significant performance impact. In > a > > common deployment scenario, there are rack-local switched fabrics and > > another switched fabric for uplinks from the racks. So, a rack would have > a > > switched GigE backplane or similar, but inter-rack connections might be > > single GigE uplinks, a ~40-to-1 reduction in capacity worst case; or > maybe > > 10 GigE uplinks, a ~10-1 reduction. Therefore it would be desirable to > > distribute the computation into the racks where the data is located. When > a > > region is deployed to a region server the underlying blocks on DFS are > not > > immediately migrated, but always after a compaction -- a rewrite -- the > > underlying blocks will be available on rack local data nodes, according > to > > my understanding of how DFS places replicas upon write. So, after a > split, > > daughter regions will have their blocks appropriately located in a timely > > manner. For the rest I wonder if it would be beneficial to consider > > scheduling major compaction more frequently than the 24 hour default for > > datacenter scale deployments, something like 8 hours, and you might also > > consider triggering a major compaction on important tables after cluster > > (re)init. Region deployment in a system in steady state should have > > relatively little churn so this will have the effect of optimizing block > > placement for region store access. > > > > Submitted for your consideration, > > > > - Andy > > > > > > > > > > > > > > ________________________________ > > From: Ninad Raut <[email protected]> > > To: hbase-user <[email protected]> > > Cc: Ranjit Nair <[email protected]> > > Sent: Thursday, May 14, 2009 2:56:04 AM > > Subject: Keeping Compute Nodes seperate from the region server node-- > pros > > and cons > > > > Hi, > > I want to get a design perspective here as to what will be the advantages > > of > > seperating region servers and compute node(to run mapreduce tasks) > > Will seperating datanodes from computes node reduce the load on the > servers > > and avoid swapping problems? > > Will this seperation make map reduce tasks less efficient , since we are > > doing away with localization issues? > > Regards, > > Ninad > > > > > > > > > > > > > > >
