Andrew Kyle Purtell created HBASE-24439:
-------------------------------------------

             Summary: Replication queue recovery tool for rescuing deep queues
                 Key: HBASE-24439
                 URL: https://issues.apache.org/jira/browse/HBASE-24439
             Project: HBase
          Issue Type: Improvement
          Components: Replication
            Reporter: Andrew Kyle Purtell


In HBase cross site replication, on the source side, every regionserver places 
its WALs into a replication queue and then drains the queue to the remote sink 
cluster. At the source cluster every regionserver participates as a source. At 
the sink cluster, a configurable subset of regionservers volunteer to process 
inbound replication RPC. 

When data is highly skewed we can take certain steps to mitigate, such as 
pre-splitting, or manual splitting, and rebalancing. This can most effectively 
be done at the sink, because replication RPCs are randomly distributed over the 
set of receiving regionservers, and splitting on the sink side can effectively 
redistribute resulting writes there. On the source side we are more limited. 

If writes are deeply unbalanced, a regionserver's source replication queue may 
become very deep. Hotspotting can happen, despite mitigations. Unlike on the 
sink side, once hotspotting has happened at the source, it is not possible to 
increase parallelism or redistribute work among sources once WALs have already 
been enqueued. Increasing parallelism on the sink side will not help if there 
is a big rock at the source. Source side mitigations like splitting and 
redistribute cannot help deep queues already accumulated.

Can we redistribute source work? Yes and no. If a source regionserver fails, 
its queues will be recovered by other regionservers. However the other rs must 
still serve the recovered queue as an atomic entity. We can move a deep queue, 
but we can't break it up. 

Where time is of the essence, and ordering semantics can be allowed to break, 
operators should have available to them a recovery tool that rescues their 
production from the consequences of  deep source queues. A very large 
replication queue can be split into many smaller queues. Perhaps even one new 
queue for each WAL file. Then, these new synthetic queues can be distributed to 
any/all source regionservers through the normal recovery queue assignment 
protocol. This increases parallelism at the source.

Of course this would break serial replication semantics, and sync replication 
semantics, and even in branch-1 which does not have these features would highly 
increase the probability of reordering of edits. That is an unavoidable 
consequence of breaking up the queue for more parallelism, but as long as this 
is done by a separate tool, invoked by operators, it is a valid option for 
emergency drain if backed up replication queues. Every cell in the WAL entries 
carries a timestamp assigned at the source, and will be applied on the sink 
with this timestamp. When the queue is drained and all edits have been 
persisted at the target, there will be a complete and correct temporal data 
ordering at that time. An operator will be and must be prepared to handle 
intermediate mis-/re-ordered states if they intend to invoke this tool. In many 
use cases the interim states are not important. The final state after all edits 
have transferred cross cluster and persisted at this sink, after invocation of 
the recovery tool, is the point where the operator would transition back into 
service.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to