[
https://issues.apache.org/jira/browse/HBASE-20426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462011#comment-16462011
]
Zheng Hu commented on HBASE-20426:
----------------------------------
The RB is too slow, so leave a comment here:
{code}
@@ -851,9 +910,32 @@ public class ReplicationSourceManager implements
ReplicationListener {
peer = replicationPeers.getPeer(src.getPeerId());
if (peer == null || !isOldPeer(src.getPeerId(), peer)) {
src.terminate("Recovered queue doesn't belong to any current
peer");
- removeRecoveredSource(src);
+ deleteQueue(queueId);
continue;
}
+ // Do not setup recovered queue if a sync replication peer is in
standby state
+ if (peer.getPeerConfig().isSyncReplication()) {
+ Pair<SyncReplicationState, SyncReplicationState>
stateAndNewState =
+ peer.getSyncReplicationStateAndNewState();
+ if
(stateAndNewState.getFirst().equals(SyncReplicationState.STANDBY) ||
+
stateAndNewState.getSecond().equals(SyncReplicationState.STANDBY)) {
+ src.terminate("Sync replication peer is in STANDBY state");
+ deleteQueue(queueId);
+ continue;
+ }
+ }
{code}
Why do we need to terminate the recovered source when in S state in
NodeFailoverWorker ? If the cluster is in S state, and is transiting to DA
state, one RS crashed when replaying the remote WALs, I don't think we can
just abandon the WALs from crashed RS, because we need those WALs to replicate
back to another cluster ...
> Give up replicating anything in S state
> ---------------------------------------
>
> Key: HBASE-20426
> URL: https://issues.apache.org/jira/browse/HBASE-20426
> Project: HBase
> Issue Type: Sub-task
> Components: Replication
> Reporter: Duo Zhang
> Assignee: Duo Zhang
> Priority: Major
> Fix For: HBASE-19064
>
> Attachments: HBASE-20426-HBASE-19064-v1.patch,
> HBASE-20426-HBASE-19064-v1.patch, HBASE-20426-HBASE-19064-v1.patch,
> HBASE-20426-HBASE-19064-v1.patch, HBASE-20426-HBASE-19064-v1.patch,
> HBASE-20426-HBASE-19064-v2.patch, HBASE-20426-HBASE-19064-v3.patch,
> HBASE-20426-HBASE-19064.patch, HBASE-20426-HBASE-19064.patch,
> HBASE-20426-HBASE-19064.patch, HBASE-20426-UT.patch
>
>
> When we transit the remote S cluster to DA, and then transit the old A
> cluster to S, it is possible that we still have some entries which have not
> been replicated yet for the old A cluster, and then the async replication
> will be blocked.
> And this may also lead to data inconsistency after we transit it to DA back
> later as these entries will be replicated again, but the new data which are
> replicated from the remote cluster will not be replicated back, which
> introduce a whole in the replication.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)