[
https://issues.apache.org/jira/browse/HBASE-10070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13875478#comment-13875478
]
Enis Soztutar commented on HBASE-10070:
---------------------------------------
Thanks Stack for looking into this. Valid feedback.
I think we can model the desired region model with something like this:
{code}
Region { table, startKey, endKey, regionId, encodedName}
RegionReplica {region, replicaId}
RegionReplicaState {offline, split}
RegionReplicaLocation {server, seqId}
RegionLineage { [splitDaughter], [mergeParent] }
{code}
I think the challenge is that, currently HRI and the meta layout + the client
level API's will need to be redesigned from scratch if we want to fully switch
to this model. Then for example table.getRegions() would only return Region's,
but table.getRegionReplicas() would return smt different. The region
assignments and everything should switch to being RegionReplica based rather
than HRI based.
If we keep HRI ~= Region + RegionReplicaState, then the region location API's,
and assignment have to be managed via RegionReplica objects and the APIs have
to be overloaded (since there won't be HRI -> location anymore). Right now what
we have is HRI ~= RegionReplica + RegionState. I guess we can spend some time
to see whether this is possible without refactoring major portions of the code
base, but I fear the answer might be what my intuition says.
For paxos / quorum case, I think we can keep the special treatment of replicaId
== 0 => primary for now. If we later change the write model, then we can have a
leader definition, but the leader would not necessarily mean replicaId = 0.
Even in that case, we have to differentiate between a server hosting a specific
replica which still required static replicaId or similar. The special case
where we do not add the replicaId to the string form of HRI is for not
requiring a meta + hdfs regioninfo rewrite. I guess we can add it there, but
add special case handling for parsing back. Would that work?
> HBase read high-availability using eventually consistent region replicas
> ------------------------------------------------------------------------
>
> Key: HBASE-10070
> URL: https://issues.apache.org/jira/browse/HBASE-10070
> Project: HBase
> Issue Type: New Feature
> Reporter: Enis Soztutar
> Assignee: Enis Soztutar
> Attachments: HighAvailabilityDesignforreadsApachedoc.pdf
>
>
> In the present HBase architecture, it is hard, probably impossible, to
> satisfy constraints like 99th percentile of the reads will be served under 10
> ms. One of the major factors that affects this is the MTTR for regions. There
> are three phases in the MTTR process - detection, assignment, and recovery.
> Of these, the detection is usually the longest and is presently in the order
> of 20-30 seconds. During this time, the clients would not be able to read the
> region data.
> However, some clients will be better served if regions will be available for
> reads during recovery for doing eventually consistent reads. This will help
> with satisfying low latency guarantees for some class of applications which
> can work with stale reads.
> For improving read availability, we propose a replicated read-only region
> serving design, also referred as secondary regions, or region shadows.
> Extending current model of a region being opened for reads and writes in a
> single region server, the region will be also opened for reading in region
> servers. The region server which hosts the region for reads and writes (as in
> current case) will be declared as PRIMARY, while 0 or more region servers
> might be hosting the region as SECONDARY. There may be more than one
> secondary (replica count > 2).
> Will attach a design doc shortly which contains most of the details and some
> thoughts about development approaches. Reviews are more than welcome.
> We also have a proof of concept patch, which includes the master and regions
> server side of changes. Client side changes will be coming soon as well.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)