Are these the good links for the Yahoo Benchmarks?
http://www.brianfrankcooper.net/pubs/ycsb-v4.pdf
http://research.*yahoo*.com/files/ycsb.pdf

Kevin


On Sat, May 8, 2010 at 3:00 PM, Ryan Rawson <ryano...@gmail.com> wrote:

> hey,
>
> HBase currently uses region count to load balance.  Regions are
> assigned in a semi-randomish order to other regionservers.
>
> The paper is somewhat correct in that we are not moving data around
> aggressively, because then people would write in complaining we move
> data around too much :-)
>
> So a few notes, HBase is not a key-value store, its a tabluar data
> store, which maintains key order, and allows the easy construction of
> left-match key indexes.
>
> One other thing... if you are using a DHT (eg: cassandra), when a node
> fails the load moves to the other servers in the ring-segment. For
> example if you have N=3 and you lose a node in a segment, the load of
> a server would move to 2 other servers.  Your monitoring system should
> probably be tied into the DHT topology since if a second node fails in
> the same ring you probably want to take action. Ironically nodes in
> cassandra are special (unlike the publicly stated info) and they
> "belong" to a particular ring segment and cannot be used to store
> other data. There are tools to do node swap in, but you want your
> cluster management to be as automated as possible.
>
> Compared to a bigtable architecture, the load of a failed regionserver
> is evenly spread across the entire rest of the cluster.  No node has a
> special role in HDFS and HBase, any data can be hosted and served from
> any node.  As nodes fail, as long as you have enough nodes to serve
> the load you are in good shape. The HDFS missing block report lets you
> know when you have lost too many nodes. Nodes have no special role and
> can host and hold any data.
>
> In the future we want to add a load balancing based on
> requests/second.  We have all the requisite data and architecture, but
> other things are up more important right now.  Pure region count load
> balancing tends to work fairly well in practice.
>
> 2010/5/8 MauMau <maumau...@gmail.com>:
> > Hello,
> >
> > I got the following error when I sent the mail.
> >
> > Technical details of permanent failure:
> > Google tried to deliver your message, but it was rejected by the
> recipient
> > domain. We recommend contacting the other email provider for further
> > information about the cause of this error. The error that the other
> server
> > returned was: 552 552 spam score (5.2) exceeded threshold (state 18).
> >
> > The original mail might have been too long, so let me split it and send
> > again.
> >
> >
> > I'm comparing HBase and Cassandra, which I think are the most promising
> > distributed key-value stores, to determine which one to choose for the
> > future OLTP and data analysis.
> > I found the following benchmark report by Yahoo! Research which evalutes
> > HBase, Cassandra, PNUTS, and sharded MySQL.
> >
> > http://wiki.apache.org/hadoop/Hbase/DesignOverview
> >
> > The above report refers to HBase 0.20.3.
> > Reading this and HBase's documentation, two questions about load
> balancing
> > and replication have risen. Could anyone give me any information to help
> > solve these questions?
> >
> > [Q1] Load balancing
> > Does HBase move regions to a newly added region server (logically, not
> > physically on storage) immediately? If not immediately, what timing?
> > On what criteria does the master unassign and assign regions among region
> > servers? CPU load, read/write request rates, or just the number of
> regions
> > the region servers are handling?
> >
> > According the HBase design overview on the page below, the master
> monitors
> > the load of each region server and moves regions.
> >
> > http://wiki.apache.org/hadoop/Hbase/DesignOverview
> >
> > The related part is the following:
> >
> > ----------------------------------------
> > HMaster duties:
> > Assigning/unassigning regions to/from HRegionServers (unassigning is for
> > load balance)
> > Monitor the health and load of each HRegionServer
> > ...
> > If HMaster detects overloaded or low loaded H!RegionServer, it will
> unassign
> > (close) some regions from most loaded H!RegionServer. Unassigned regions
> > will be assigned to low loaded servers.
> > ----------------------------------------
> >
> > When I read the above, I thought that the master checks the load of
> region
> > servers periodically (once a few minutes or so) and performs load
> balancing.
> > And I thought that the master unassigns some regions from the existing
> > loaded region servers to a newly added one immediately when the new
> server
> > joins the cluster and contacts the master.
> > However, the benchmark report by Yahoo! Research describes as follows.
> This
> > says that HBase does not move regions until compaction, so I cannot get
> the
> > effect of adding new servers immediately even if I added the new server
> to
> > solve the overload problem.
> > What's the fact?
> >
> > ----------------------------------------
> > 6.7 Elastic Speedup
> > As the figure shows, the read latency spikes initially after
> > the sixth server is added, before the latency stabilizes at a
> > value slightly lower than the latency for five servers. This result
> > indicates that HBase is able to shift read and write load
> > to the new server, resulting in lower latency. HBase does
> > not move existing data to the new server until compactions
> > occur2. The result is less latency variance compared to Cassandra
> > since there is no repartitioning process competing
> > for the disk. However, the new server is underutilized, since
> > existing data is served off the old servers.
> > ...
> > 2 It is possible to run the HDFS load balancer to force data
> > to the new servers, but this greatly disrupts HBase’s ability
> > to serve data partitions from the same servers on which they
> > are stored.
> > ----------------------------------------
> >
> > MauMau
> >
> >
> >
>

Reply via email to