[ 
https://issues.apache.org/jira/browse/HBASE-12865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14281689#comment-14281689
 ] 

Lars Hofhansl commented on HBASE-12865:
---------------------------------------

Actually #3 can work. We can be lame and take a lock in 
CleanerChore.checkAndDeleteFiles, and 
ReplicationSourceManager.NodeFailoverWorker.run(). That way we have a 
transactionally safe check and delete w.r.t. stuff the the NodeFailoverWorkder 
can add.

The downside is that the cleaner chore is also doing HFiles, etc.
We can work around that by adding some hooks into CleanerChore that are called 
before and after checkAndDeleteFiles, or we make checkAndDeleteFiles in 
CleanerChore protected and then override and wrap that in the 
ReplicationLogCleaner.



> WALs may be deleted before they are replicated to peers
> -------------------------------------------------------
>
>                 Key: HBASE-12865
>                 URL: https://issues.apache.org/jira/browse/HBASE-12865
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>            Reporter: Liu Shaohui
>
> By design, ReplicationLogCleaner guarantee that the WALs  being in 
> replication queue can't been deleted by the HMaster. The 
> ReplicationLogCleaner gets the WAL set from zookeeper by scanning the 
> replication zk node. But it may get uncompleted WAL set during replication 
> failover for the scan operation is not atomic.
> For example: There are three region servers: rs1, rs2, rs3, and peer id 10.  
> The layout of replication zookeeper nodes is:
> {code}
> /hbase/replication/rs/rs1/10/wals
>                      /rs2/10/wals
>                      /rs3/10/wals
> {code}
> - t1: the ReplicationLogCleaner finished scanning the replication queue of 
> rs1, and start to scan the queue of rs2.
> - t2: region server rs3 is down, and rs1 take over rs3's replication queue. 
> The new layout is
> {code}
> /hbase/replication/rs/rs1/10/wals
>                      /rs1/10-rs3/wals
>                      /rs2/10/wals
>                      /rs3
> {code}
> - t3, the ReplicationLogCleaner finished scanning the queue of rs2, and start 
> to scan the node of rs3. But the the queue has been moved to  
> "replication/rs1/10-rs3/WALS"
> So the  ReplicationLogCleaner will miss the WALs of rs3 in peer 10 and the 
> hmaster may delete these WALs before they are replicated to peer clusters.
> We encountered this problem in our cluster and I think it's a serious bug for 
> replication.
> Suggestions are welcomed to fix this bug. thx~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to