[ 
https://issues.apache.org/jira/browse/HBASE-15867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512711#comment-17512711
 ] 

Duo Zhang commented on HBASE-15867:
-----------------------------------

I've been thinking of this for a long time.

Basically there are two parts of replication storage, one is for storing 
replication peer, the other is for storing replication queue. There are several 
problems here, first is we need to load all replication peers in 
HRegionServer.setupWALAndReplication, the second is when creating a new WAL 
writer, we need to record the new wal file before actually writing to it.

The first one introduces a cyclic dependency on RS start up and assigning a 
region. If we want to store peer information in a region, then we need have a 
RS which could be used to assign regions to it. But anyway, since replication 
peer's information is not very large, and also has a low qps, I think we can 
store it in master local region, and let RS request master through rpc to get 
it, i.e, introduce a MasterReplicationPeerStorage. We need to communicate with 
master when starting a RS, so it does not add new dependency for region server 
start up, and when we want to touch the replication peer storage, it usually 
means we want to add/remove/modify peer, enable/disable peer, all these 
operations need master to be up first as we need to send request to master 
first, so it is not a big deal to let RS rely on master when doing these 
operations. And maybe we could just store the replication peer as a file on the 
DFS, since we will not have too much replication peers. Anyway, it is easier to 
be fixed comparing to the second problem.

The second one introduces a cyclic dependency on assigning a region and 
creating WAL for the region. This could be solved by storing the replication 
queue information in a region which will never be replicated. For example, in 
HBASE-22938 we proposed to fold all system tables to hbase:meta. Or we could 
introduce a separated WAL instance for system tables, just like what we have 
done for hbase:meta.

But I'm still not satisified with the above approach, that's why I still do not 
actually start to work on this issue. ZooKeeper is designed to be HA and its 
failover is pretty fast. But for HBase, if the region server which holds the 
region which stores the replication queue information crashes, we will hang the 
WAL rolling for the whole cluster for a 'long' time(usually tens of seconds or 
even several minutes). I think it will hurt the availability of the HBase 
cluster.

So I spent a lot of time to think whether it is possible to not rely on 
replication queue storage when rolling WAL. Recently I've gotten a rough idea 
in my mind on how to remove  the dependency so let me put it here first. I 
think writing a solution out could help you polish your idea and also let you 
know if it really works. And I also want others in the community to consider 
whether it works.

Besides WAL rolling, we use replication queue storage at two place, one is in 
replication, we will get the files which need to replicate, and also record the 
replication progress, i.e, the offset in a file where all the entries before it 
have been replicated. The other is in ReplicationHFileCleaner, where we will 
check whether a HFile is recorded in the replication queue storage, if it is, 
then we should not delete it.

The basic idea to avoid record every file here is that, we could sort the WAL 
files of a regionserver by their name, the order is exactly the orderof when 
the files are written. If multi-WAL is enabled, we will have several groups, 
but in each group, we could still sort the files. So for deleting, we only need 
to know the which file we are currently replicating, then we could know that 
all the files before this file can be deleted, and all files after this 
file(include this file) can not be deleted. And for replicating, we could also 
know which file should be replicated next after finishing replicating a file.

So the solution here is that, we will consider replication offset per queue, 
not per file. The replication offset will be a (peer_id, regionserver, 
group(can be empty if no multi-WAL), file, offset_in_file) tuple. In 
ReplicationQueueStorage, we will only record the replication offset, without 
record the actual files which need to be replicated. In this way, when rolling 
WAL we do not need to use ReplicationQueueStorage any more. For replicating a 
normal replication queue, the files to be replicated is always maintained in 
memory, so there is no problem to just record the replication offset while 
replicating. For recovered replication queue, the ReplicationQueueStorage still 
needs to provide a claimQueue method, and for getting the actual files to be 
replicated, we need to list all WAL files of the given regionserver, and filter 
out the special group if needed, i.e, multi-WAL is enabled. The WAL files 
should all be placed in the oldWALs directory as the region server is already 
dead, but for safety, it is no harm to also list the WALs directory. For 
ReplicationLogCleaner, the new algorithm will be that, for a cleaner round, in 
preClean we will load all the replication offset, when testing whether a WAL 
file can be deleted, we check whether it is before the replication offset for 
its group, if so, it can be deleted, otherwise not. If there is no replication 
offset for its group, I do not think we should delete this file as in general, 
all files should be replicated(although we may just skip all the entries in it).

Above is my understand for now. If no big concerns, I will start to implement a 
POC to see if it really works.

Thanks.

> Move HBase replication tracking from ZooKeeper to HBase
> -------------------------------------------------------
>
>                 Key: HBASE-15867
>                 URL: https://issues.apache.org/jira/browse/HBASE-15867
>             Project: HBase
>          Issue Type: New Feature
>          Components: Replication
>    Affects Versions: 2.1.0
>            Reporter: Joseph
>            Assignee: Zheng Hu
>            Priority: Major
>
> Move the WAL file and offset tracking out of ZooKeeper and into an HBase 
> table called hbase:replication. 
> The largest three new changes will be two classes ReplicationTableBase, 
> TableBasedReplicationQueues, and TableBasedReplicationQueuesClient. As of now 
> ReplicationPeers and HFileRef's tracking will not be implemented. Subtasks 
> have been filed for these two jobs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to