[ 
https://issues.apache.org/jira/browse/HBASE-20426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462011#comment-16462011
 ] 

Zheng Hu commented on HBASE-20426:
----------------------------------

The RB is too slow, so leave a comment here: 
{code}
@@ -851,9 +910,32 @@ public class ReplicationSourceManager implements 
ReplicationListener {
             peer = replicationPeers.getPeer(src.getPeerId());                 
             if (peer == null || !isOldPeer(src.getPeerId(), peer)) {          
               src.terminate("Recovered queue doesn't belong to any current 
peer");
-              removeRecoveredSource(src);                                     
+              deleteQueue(queueId);                                           
               continue;                                                       
             }                                                                 
+            // Do not setup recovered queue if a sync replication peer is in 
standby state
+            if (peer.getPeerConfig().isSyncReplication()) {                   
+              Pair<SyncReplicationState, SyncReplicationState> 
stateAndNewState =
+                peer.getSyncReplicationStateAndNewState();                    
+              if 
(stateAndNewState.getFirst().equals(SyncReplicationState.STANDBY) ||
+                
stateAndNewState.getSecond().equals(SyncReplicationState.STANDBY)) {
+                src.terminate("Sync replication peer is in STANDBY state");   
+                deleteQueue(queueId);                                         
+                continue;                                                     
+              }                                                               
+            }
{code}

Why do we need to terminate the recovered source when in S state  in 
NodeFailoverWorker ?  If the cluster is in  S state,  and is transiting to DA 
state,  one RS crashed when replaying the remote WALs,  I don't  think  we can 
just abandon the WALs from crashed RS, because we need those WALs to replicate 
back to another cluster ... 

> Give up replicating anything in S state
> ---------------------------------------
>
>                 Key: HBASE-20426
>                 URL: https://issues.apache.org/jira/browse/HBASE-20426
>             Project: HBase
>          Issue Type: Sub-task
>          Components: Replication
>            Reporter: Duo Zhang
>            Assignee: Duo Zhang
>            Priority: Major
>             Fix For: HBASE-19064
>
>         Attachments: HBASE-20426-HBASE-19064-v1.patch, 
> HBASE-20426-HBASE-19064-v1.patch, HBASE-20426-HBASE-19064-v1.patch, 
> HBASE-20426-HBASE-19064-v1.patch, HBASE-20426-HBASE-19064-v1.patch, 
> HBASE-20426-HBASE-19064-v2.patch, HBASE-20426-HBASE-19064-v3.patch, 
> HBASE-20426-HBASE-19064.patch, HBASE-20426-HBASE-19064.patch, 
> HBASE-20426-HBASE-19064.patch, HBASE-20426-UT.patch
>
>
> When we transit the remote S cluster to DA, and then transit the old A 
> cluster to S, it is possible that we still have some entries which have not 
> been replicated yet for the old A cluster, and then the async replication 
> will be blocked.
> And this may also lead to data inconsistency after we transit it to DA back 
> later as these entries will be replicated again, but the new data which are 
> replicated from the remote cluster will not be replicated back, which 
> introduce a whole in the replication.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to