[
https://issues.apache.org/jira/browse/HBASE-10070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13849978#comment-13849978
]
Enis Soztutar commented on HBASE-10070:
---------------------------------------
Update: as discussed, we have put a proof-of-concept implementation for a
working end-to-end scenario, and would like to share that to get some early
reviews and feedback. If you are interested on the technical side of the
changes, please check the patch/branch out. Please note that the patches and
the branch is far from being clean and complete, but otherwise clean enough to
understand the scope of changes and areas that are touched. This also contains
the end-to-end API's at the client side (except for execution policies). We
will continue to work on the patches to get them in a more mature state, and
recreate and clean up the patches for reviews, but at any stage, reviews /
comments are welcome. We will keep pushing the changes to this repo / branch
until the patches are in a more stable state, at which point, we will work on
cleaning up and shuffling the patches to be more consumable by reviews.
The code is at github repo: https://github.com/enis/hbase.git, and the branch
is hbase-10070-demo. This repository is based on 0.96.0 for now. I'll also
attach a patch which contains all the changes if you want to take a closer
look.
This can be build with:
{code}
git clone [email protected]:enis/hbase.git
cd hbase
git checkout hbase-10070-demo
MAVEN_OPTS="-Xmx2g" mvn clean install package assembly:single
-Dhadoop.profile=2.0 -DskipTests -Dmaven.javadoc.skip=true
-Dhadoop-two.version=2.2.0
{code}
The tar ball generated would be hbase-assembly/target/hbase-0.96.0-bin.tar.gz
The hadoop version that should be used for real cluster testing is 2.2.0.
What's there in the repository:
1. Client (Shell) changes
The shell has been modified so that tables with more than one replica per
region can be created:
create 't1', 'f1', {REGION_REPLICATION => 2}
One can also 'describe' a table and that will have the replica configuration in
the response string:
describe 't1'
One can do a 'get' with the eventual consistency flag set (we haven't
implemented the consistency semantics for the 'scan' family in this drop):
get 't2','r6',{"EVENTUAL_CONSISTENCY" => true}
[NOTE THE quotes around the EVENTUAL_CONSISTENCY string. Will fix this soon to
work without the quotes.]
Outside the shell, the API to do with setting the willingness to tolerate
eventual consistency is Get.setConsistency and the returned Result can be
queried if it is stale or not via Result.isStale
2. Master changes
The one main change here is about creation and management of replica
HRegionInfo objects. The other change is to make the StochasticLoadBalancer
aware of the replicas. During the assignment process, the Assignment Manager
consults the balancer to give it a plan for the assignment - here the
StochasticLoadBalancer ensures that the plan takes into account the constraint
- primary/secondary not assigned to the same server, same rack (if more than
one rack configured).
3. RegionServer changes
The one main change here is to be able to open regions in readonly mode. The
other change here is to do with the periodic refresh of store files. The
configuration that sets this up is (this is disabled by default):
{code}
<property>
<name>hbase.regionserver.storefile.refresh.period</name>
<value>2000</value>
</property>
{code}
4. UI changes
The UIs corresponding to the tables' status and the regions' status have been
modified to say whether they have replicas.
There are unit tests -
TestMasterReplicaRegions,TestRegionReplicas,TestReplicasClient,TestBaseLoadBalancer
and some others.
There is also a manual test scenario to test out reads coming from the
secondary replica:
1. create a (at least) two node cluster.
2. create a table with replica 2. From HBase shell:
create 't1', 'f1', {REGION_REPLICATION => 2}
3. arrange the regions so that the the primary region is not co-located with
meta or namespace regions.
You can use move commands in HBase shell for that:
move '4392870ae8ef482406c272eec0312a02', '192.168.0.106,60020,1387069812919'
4. from the shell do a couple of puts and then 'flush' the table from the shell
{code}
hbase(main):005:0> for i in 1..100
hbase(main):006:1> put 't1', "r#{i}", 'f1:c1', i
hbase(main):007:1> end
hbase(main):009:0> flush 't1'
{code}
5. suspend the region server which is hosting the primary region replica by
sending kill -STOP signal from bash:
{code}
kill -STOP <pid_of_region_server>
{code}
6. get a row from the table with eventual-consistency flag set to true.
{code}
get 't2','r6',{"EVENTUAL_CONSISTENCY" => true}
{code}
7. put should fail
The (6) and (7) steps should be done quickly enough otherwise the master would
recover the region!! (Default ZK session timeout is 90 seconds)
> HBase read high-availability using eventually consistent region replicas
> ------------------------------------------------------------------------
>
> Key: HBASE-10070
> URL: https://issues.apache.org/jira/browse/HBASE-10070
> Project: HBase
> Issue Type: New Feature
> Reporter: Enis Soztutar
> Assignee: Enis Soztutar
> Attachments: HighAvailabilityDesignforreadsApachedoc.pdf
>
>
> In the present HBase architecture, it is hard, probably impossible, to
> satisfy constraints like 99th percentile of the reads will be served under 10
> ms. One of the major factors that affects this is the MTTR for regions. There
> are three phases in the MTTR process - detection, assignment, and recovery.
> Of these, the detection is usually the longest and is presently in the order
> of 20-30 seconds. During this time, the clients would not be able to read the
> region data.
> However, some clients will be better served if regions will be available for
> reads during recovery for doing eventually consistent reads. This will help
> with satisfying low latency guarantees for some class of applications which
> can work with stale reads.
> For improving read availability, we propose a replicated read-only region
> serving design, also referred as secondary regions, or region shadows.
> Extending current model of a region being opened for reads and writes in a
> single region server, the region will be also opened for reading in region
> servers. The region server which hosts the region for reads and writes (as in
> current case) will be declared as PRIMARY, while 0 or more region servers
> might be hosting the region as SECONDARY. There may be more than one
> secondary (replica count > 2).
> Will attach a design doc shortly which contains most of the details and some
> thoughts about development approaches. Reviews are more than welcome.
> We also have a proof of concept patch, which includes the master and regions
> server side of changes. Client side changes will be coming soon as well.
--
This message was sent by Atlassian JIRA
(v6.1.4#6159)