Andrew Kyle Purtell created HBASE-24439:
-------------------------------------------
Summary: Replication queue recovery tool for rescuing deep queues
Key: HBASE-24439
URL: https://issues.apache.org/jira/browse/HBASE-24439
Project: HBase
Issue Type: Improvement
Components: Replication
Reporter: Andrew Kyle Purtell
In HBase cross site replication, on the source side, every regionserver places
its WALs into a replication queue and then drains the queue to the remote sink
cluster. At the source cluster every regionserver participates as a source. At
the sink cluster, a configurable subset of regionservers volunteer to process
inbound replication RPC.
When data is highly skewed we can take certain steps to mitigate, such as
pre-splitting, or manual splitting, and rebalancing. This can most effectively
be done at the sink, because replication RPCs are randomly distributed over the
set of receiving regionservers, and splitting on the sink side can effectively
redistribute resulting writes there. On the source side we are more limited.
If writes are deeply unbalanced, a regionserver's source replication queue may
become very deep. Hotspotting can happen, despite mitigations. Unlike on the
sink side, once hotspotting has happened at the source, it is not possible to
increase parallelism or redistribute work among sources once WALs have already
been enqueued. Increasing parallelism on the sink side will not help if there
is a big rock at the source. Source side mitigations like splitting and
redistribute cannot help deep queues already accumulated.
Can we redistribute source work? Yes and no. If a source regionserver fails,
its queues will be recovered by other regionservers. However the other rs must
still serve the recovered queue as an atomic entity. We can move a deep queue,
but we can't break it up.
Where time is of the essence, and ordering semantics can be allowed to break,
operators should have available to them a recovery tool that rescues their
production from the consequences of deep source queues. A very large
replication queue can be split into many smaller queues. Perhaps even one new
queue for each WAL file. Then, these new synthetic queues can be distributed to
any/all source regionservers through the normal recovery queue assignment
protocol. This increases parallelism at the source.
Of course this would break serial replication semantics, and sync replication
semantics, and even in branch-1 which does not have these features would highly
increase the probability of reordering of edits. That is an unavoidable
consequence of breaking up the queue for more parallelism, but as long as this
is done by a separate tool, invoked by operators, it is a valid option for
emergency drain if backed up replication queues. Every cell in the WAL entries
carries a timestamp assigned at the source, and will be applied on the sink
with this timestamp. When the queue is drained and all edits have been
persisted at the target, there will be a complete and correct temporal data
ordering at that time. An operator will be and must be prepared to handle
intermediate mis-/re-ordered states if they intend to invoke this tool. In many
use cases the interim states are not important. The final state after all edits
have transferred cross cluster and persisted at this sink, after invocation of
the recovery tool, is the point where the operator would transition back into
service.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)