Thank you, Ryan
I look forward to the future advancement in load balancing strategy ---
distributing "hot" regions among lightly loaded region servers based on
request count.
Thanks for the note that HBase is not a key-value store. Yes, some people
say like you, and others say that HBase is one of the key-value stores. I
think they call HBase call HBase as a key-value store in the sense that
users access data by keys (row key, column key).
the same ring you probably want to take action. Ironically nodes in
cassandra are special (unlike the publicly stated info) and they
"belong" to a particular ring segment and cannot be used to store
other data. There are tools to do node swap in, but you want your
Do you mean "preference list" by "ring segment"? If all nodes in the
preference list for particular data get down before another node takes over
the data, the data cannot be read. But I thought writes can be accepted by
the mechanism called hinted handoff.
However, the probability that all nodes in the preference list crash is very
low. That is the same for HBase; if all data nodes for particular data of
HBase get down, that data cannot be accessed.
The original concern is whether I can answer "yes" in the following
situation:
Customer: The system is slow. How can we improve the responsiveness?
Me: The response time of the data store IS high, and all region servers are
busy. You can improve the situation by adding some region servers.
Customer: Just adding new servers is OK? Can we see the improvement soon?
Maumau
----- Original Message -----
From: "Ryan Rawson" <ryano...@gmail.com>
To: <hbase-user@hadoop.apache.org>
Sent: Saturday, May 08, 2010 6:30 PM
Subject: Re: How does HBase perform load balancing?
hey,
HBase currently uses region count to load balance. Regions are
assigned in a semi-randomish order to other regionservers.
The paper is somewhat correct in that we are not moving data around
aggressively, because then people would write in complaining we move
data around too much :-)
So a few notes, HBase is not a key-value store, its a tabluar data
store, which maintains key order, and allows the easy construction of
left-match key indexes.
One other thing... if you are using a DHT (eg: cassandra), when a node
fails the load moves to the other servers in the ring-segment. For
example if you have N=3 and you lose a node in a segment, the load of
a server would move to 2 other servers. Your monitoring system should
probably be tied into the DHT topology since if a second node fails in
the same ring you probably want to take action. Ironically nodes in
cassandra are special (unlike the publicly stated info) and they
"belong" to a particular ring segment and cannot be used to store
other data. There are tools to do node swap in, but you want your
cluster management to be as automated as possible.
Compared to a bigtable architecture, the load of a failed regionserver
is evenly spread across the entire rest of the cluster. No node has a
special role in HDFS and HBase, any data can be hosted and served from
any node. As nodes fail, as long as you have enough nodes to serve
the load you are in good shape. The HDFS missing block report lets you
know when you have lost too many nodes. Nodes have no special role and
can host and hold any data.
In the future we want to add a load balancing based on
requests/second. We have all the requisite data and architecture, but
other things are up more important right now. Pure region count load
balancing tends to work fairly well in practice.