[ https://issues.apache.org/jira/browse/HBASE-15867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512711#comment-17512711 ]
Duo Zhang commented on HBASE-15867: ----------------------------------- I've been thinking of this for a long time. Basically there are two parts of replication storage, one is for storing replication peer, the other is for storing replication queue. There are several problems here, first is we need to load all replication peers in HRegionServer.setupWALAndReplication, the second is when creating a new WAL writer, we need to record the new wal file before actually writing to it. The first one introduces a cyclic dependency on RS start up and assigning a region. If we want to store peer information in a region, then we need have a RS which could be used to assign regions to it. But anyway, since replication peer's information is not very large, and also has a low qps, I think we can store it in master local region, and let RS request master through rpc to get it, i.e, introduce a MasterReplicationPeerStorage. We need to communicate with master when starting a RS, so it does not add new dependency for region server start up, and when we want to touch the replication peer storage, it usually means we want to add/remove/modify peer, enable/disable peer, all these operations need master to be up first as we need to send request to master first, so it is not a big deal to let RS rely on master when doing these operations. And maybe we could just store the replication peer as a file on the DFS, since we will not have too much replication peers. Anyway, it is easier to be fixed comparing to the second problem. The second one introduces a cyclic dependency on assigning a region and creating WAL for the region. This could be solved by storing the replication queue information in a region which will never be replicated. For example, in HBASE-22938 we proposed to fold all system tables to hbase:meta. Or we could introduce a separated WAL instance for system tables, just like what we have done for hbase:meta. But I'm still not satisified with the above approach, that's why I still do not actually start to work on this issue. ZooKeeper is designed to be HA and its failover is pretty fast. But for HBase, if the region server which holds the region which stores the replication queue information crashes, we will hang the WAL rolling for the whole cluster for a 'long' time(usually tens of seconds or even several minutes). I think it will hurt the availability of the HBase cluster. So I spent a lot of time to think whether it is possible to not rely on replication queue storage when rolling WAL. Recently I've gotten a rough idea in my mind on how to remove the dependency so let me put it here first. I think writing a solution out could help you polish your idea and also let you know if it really works. And I also want others in the community to consider whether it works. Besides WAL rolling, we use replication queue storage at two place, one is in replication, we will get the files which need to replicate, and also record the replication progress, i.e, the offset in a file where all the entries before it have been replicated. The other is in ReplicationHFileCleaner, where we will check whether a HFile is recorded in the replication queue storage, if it is, then we should not delete it. The basic idea to avoid record every file here is that, we could sort the WAL files of a regionserver by their name, the order is exactly the orderof when the files are written. If multi-WAL is enabled, we will have several groups, but in each group, we could still sort the files. So for deleting, we only need to know the which file we are currently replicating, then we could know that all the files before this file can be deleted, and all files after this file(include this file) can not be deleted. And for replicating, we could also know which file should be replicated next after finishing replicating a file. So the solution here is that, we will consider replication offset per queue, not per file. The replication offset will be a (peer_id, regionserver, group(can be empty if no multi-WAL), file, offset_in_file) tuple. In ReplicationQueueStorage, we will only record the replication offset, without record the actual files which need to be replicated. In this way, when rolling WAL we do not need to use ReplicationQueueStorage any more. For replicating a normal replication queue, the files to be replicated is always maintained in memory, so there is no problem to just record the replication offset while replicating. For recovered replication queue, the ReplicationQueueStorage still needs to provide a claimQueue method, and for getting the actual files to be replicated, we need to list all WAL files of the given regionserver, and filter out the special group if needed, i.e, multi-WAL is enabled. The WAL files should all be placed in the oldWALs directory as the region server is already dead, but for safety, it is no harm to also list the WALs directory. For ReplicationLogCleaner, the new algorithm will be that, for a cleaner round, in preClean we will load all the replication offset, when testing whether a WAL file can be deleted, we check whether it is before the replication offset for its group, if so, it can be deleted, otherwise not. If there is no replication offset for its group, I do not think we should delete this file as in general, all files should be replicated(although we may just skip all the entries in it). Above is my understand for now. If no big concerns, I will start to implement a POC to see if it really works. Thanks. > Move HBase replication tracking from ZooKeeper to HBase > ------------------------------------------------------- > > Key: HBASE-15867 > URL: https://issues.apache.org/jira/browse/HBASE-15867 > Project: HBase > Issue Type: New Feature > Components: Replication > Affects Versions: 2.1.0 > Reporter: Joseph > Assignee: Zheng Hu > Priority: Major > > Move the WAL file and offset tracking out of ZooKeeper and into an HBase > table called hbase:replication. > The largest three new changes will be two classes ReplicationTableBase, > TableBasedReplicationQueues, and TableBasedReplicationQueuesClient. As of now > ReplicationPeers and HFileRef's tracking will not be implemented. Subtasks > have been filed for these two jobs. -- This message was sent by Atlassian Jira (v8.20.1#820001)