Andrew Kyle Purtell created HBASE-30139:
-------------------------------------------

             Summary: RAFT-Based Promotable Region Replicas
                 Key: HBASE-30139
                 URL: https://issues.apache.org/jira/browse/HBASE-30139
             Project: HBase
          Issue Type: New Feature
          Components: meta replicas, read replicas
            Reporter: Andrew Kyle Purtell
            Assignee: Andrew Kyle Purtell


HBase supports configuring a table with multiple region replicas. When a table 
has replicas, each region exists as a primary copy and one or more read-only 
copies hosted on different RegionServers. The primary handles all client writes 
and serves the default read path. Two read-only replicas are opened on other 
RegionServers, sharing the primary's HFiles on HDFS, and receive memstore 
updates through an asynchronous replication pipeline. Clients may read from 
replicas using timeline-consistent reads. Replicas cannot accept writes and 
cannot be promoted to primary. This model improves read availability for stale 
data tolerant workloads, but it does nothing for write availability or fast 
failover. When the primary's RegionServer dies, the region becomes unavailable 
for writes. Read-only replicas can still serve timeline-consistent reads, but 
with increasingly stale data. Replicas can be arbitrarily behind the primary, 
so even their stale-read utility degrades under replication lag. There is no 
protocol to determine which replica is most current or to coordinate a handoff.


This design replaces the asynchronous WAL replication pipeline with RAFT 
consensus groups at the region level. Each set of replicas for a region forms a 
RAFT group. The primary region acts as the RAFT leader, and the read-only 
replica regions act as RAFT followers. The leader replicates edits 
synchronously through RAFT to keep follower memstores warm and consistent, 
replacing the best-effort async pipeline with an ordered, majority-committed 
log.

The key improvement is {*}promotability{*}. When the primary fails, the 
surviving followers already hold a warm, consistent memstore. They elect a new 
RAFT leader among themselves, and the elected leader reports the election 
result to the master. The master's AssignmentManager remains the sole arbiter 
of which region is primary. It validates the RAFT election term, updates META 
to record the new primary location, and returns confirmation to the 
RegionServer. Only after receiving this confirmation does the promoted replica 
complete its local state transitions and begin serving writes. However, this 
happens very quickly compared to today's WAL mediated recovery pathway. There 
is no WAL splitting and no recovered-edits replay. Failover completes in 
sub-second to low-single-digit seconds.

Some of you may remember Facebook's ancient "HydraBase". This is NOT HydraBase 
redux and does not repeat its design errors.

Design document: 
[https://github.com/apurtell/hbase/blob/WORK-raft-replicas/RAFT_REGION_REPLICAS.md]
 

{{hbase-consensus}} proof-of-concept: 
[https://github.com/apurtell/hbase/blob/WORK-raft-replicas/hbase-consensus/] 

Currently this is at "science project" stage. When that changes I will update 
this part of the summary with strikethrough. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to