[ 
https://issues.apache.org/jira/browse/HBASE-12865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14281686#comment-14281686
 ] 

Lars Hofhansl edited comment on HBASE-12865 at 1/18/15 7:24 AM:
----------------------------------------------------------------

Hmm... Couple of thoughts:
# Can we do simple optimistic concurrency control here? In the beginning of the 
method we check the parent node's cversion (that is the number of changes to 
the children of this znode), in the end we check it again. If it changed we 
start over inside the method, or simply say that no files can be deleted and 
try again during the next call.
# Maybe when an RS takes over a queue, should it touch all involved logs first, 
so they all get a new timestamps? In that case they would not be eligible for 
deletion until they expire again. That would need to be done *before* the 
queues are moved in ZK.
# Or, since this is only an issue when the *same* region server enumerates the 
queues and adds a queue from another RS we only need coordination between the 
threads doing this. That is: Block the NodeFailoverWorker from claiming any new 
queues while there's a cleanup or check in process.
# There might also be a more complex problem. Queues could be moved *after* we 
checked, but before we get to the delete code. So we would need to make sure 
queues are not moved until after we finished a delete cycle.

#1 seems simple enough. #2 should work, but there's no guarantee and it means 
NN actions. #3 should also work fine.

#4 is a concern with all these approaches; we need to avoid changes to the 
queue from the point we start checking to the queues to the point where we 
finish the current delete cycle. And so that cannot be handled 100% in a 
LogCleaner alone (we might need to add begin() and end() hooks to the 
cleaners... Ugh.)


was (Author: lhofhansl):
Hmm... Couple of thoughts:
# Can we do simple optimistic concurrency control here? In the beginning of the 
method we check the parent node's cversion (that is the number of changes to 
the children of this znode), in the end we check it again. If it changed we 
start over inside the method, or simply say that no files can be deleted and 
try again during the next call.
# Maybe when an RS takes over a queue, should it touch all involved logs first, 
so they all get a new timestamps? In that case they would not be eligible for 
deletion until they expire again. That would need to be done *before* the 
queues are moved in ZK.
# Or, since this is only an issue when the *same* region server enumerates the 
queues and adds a queue from another RS we only need coordination between the 
threads doing this. That is: Block the NodeFailoverWorker from claiming any new 
queues while there's a cleanup or check in process.
# There might also be a more complex problem. Queues could be moved *after* we 
checked, but before we get to the delete code. So we would need to make sure 
queues are not moved until after we finished a delete cycle.

#1 seems simple enough. #2 should work, but there's no guarantee and it means 
NN actions. #3 should also work fine.

#4 is a concern with all these approaches; we need to avoid there no changes in 
queue from the point we start checking to the queues to the point where we 
finish the current delete cycle. And so that cannot be handled with 100% in a 
LogCleaner alone (we might need to add begin() and end() hooks to the 
cleaners... Ugh.)

> WALs may be deleted before they are replicated to peers
> -------------------------------------------------------
>
>                 Key: HBASE-12865
>                 URL: https://issues.apache.org/jira/browse/HBASE-12865
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>            Reporter: Liu Shaohui
>
> By design, ReplicationLogCleaner guarantee that the WALs  being in 
> replication queue can't been deleted by the HMaster. The 
> ReplicationLogCleaner gets the WAL set from zookeeper by scanning the 
> replication zk node. But it may get uncompleted WAL set during replication 
> failover for the scan operation is not atomic.
> For example: There are three region servers: rs1, rs2, rs3, and peer id 10.  
> The layout of replication zookeeper nodes is:
> {code}
> /hbase/replication/rs/rs1/10/wals
>                      /rs2/10/wals
>                      /rs3/10/wals
> {code}
> - t1: the ReplicationLogCleaner finished scanning the replication queue of 
> rs1, and start to scan the queue of rs2.
> - t2: region server rs3 is down, and rs1 take over rs3's replication queue. 
> The new layout is
> {code}
> /hbase/replication/rs/rs1/10/wals
>                      /rs1/10-rs3/wals
>                      /rs2/10/wals
>                      /rs3
> {code}
> - t3, the ReplicationLogCleaner finished scanning the queue of rs2, and start 
> to scan the node of rs3. But the the queue has been moved to  
> "replication/rs1/10-rs3/WALS"
> So the  ReplicationLogCleaner will miss the WALs of rs3 in peer 10 and the 
> hmaster may delete these WALs before they are replicated to peer clusters.
> We encountered this problem in our cluster and I think it's a serious bug for 
> replication.
> Suggestions are welcomed to fix this bug. thx~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to