[ 
https://issues.apache.org/jira/browse/HBASE-15867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523108#comment-17523108
 ] 

Duo Zhang commented on HBASE-15867:
-----------------------------------

OK, now claimQueue is driven by master, we have a 
ClaimReplicationQueuesProcedure, which is the last step of SCP.

Then I think a possible solution is to also list all the WAL files of the dead 
region server, so we can know the replication queues of these WAL files, and 
check whether we have these queues in ReplicationQueueStorage, if not, we 
insert the initial replication offsets into the queue, so the region server can 
claim the queue. I think this is a possible way to solve the problem.

Of course this will introduce a dependency on the replication queue table for 
SCP, but FWIW, it is already there... Theoretically, when we get to this step 
in SCP, all the regions on the dead region server are already onlined, so there 
will be no cyclic dependency. Practically, if the replication queue table is 
not online and we have bunch of SCPs which hang at the last claim queue step, 
the system may hang for a long time. So we'd better add a state check method in 
ReplicationQueueStorage interface, if the replication queue table is not 
online, then let's suspend the SCP for a while, to give other SCPs the chance 
to bring the replication queue table online, and also, we should have a small 
rpc and operation timeout when accessing the ReplicationQueueStorage in 
ClaimReplicationQueuesProcedure, so we will not block a PEWorker for a long 
time.

> Move HBase replication tracking from ZooKeeper to HBase
> -------------------------------------------------------
>
>                 Key: HBASE-15867
>                 URL: https://issues.apache.org/jira/browse/HBASE-15867
>             Project: HBase
>          Issue Type: New Feature
>          Components: Replication
>    Affects Versions: 2.1.0
>            Reporter: Joseph
>            Assignee: Zheng Hu
>            Priority: Major
>
> Move the WAL file and offset tracking out of ZooKeeper and into an HBase 
> table called hbase:replication. 
> The largest three new changes will be two classes ReplicationTableBase, 
> TableBasedReplicationQueues, and TableBasedReplicationQueuesClient. As of now 
> ReplicationPeers and HFileRef's tracking will not be implemented. Subtasks 
> have been filed for these two jobs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to