[ 
https://issues.apache.org/jira/browse/BOOKKEEPER-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13281470#comment-13281470
 ] 

Rakesh R commented on BOOKKEEPER-237:
-------------------------------------

Thanks Flavio for the comments. I hope the following will give more idea.

bq. Just to confirm, elements in the myId list have to be deleted manually, 
yes? If a node is decommissioned, then I suppose we will want to delete from 
the list.

Yes, since these are persistent nodes need to delete manually or will be able 
to think automatic deletion after the full re-replication of that 
failed/decommissioned bookie. I feel, here again chances of race conditions to 
be avoided, like garbage collector(GC: for deleting inactive id) decides to 
delete the node, meanwhile the failed bookie rejoins and acquire his 'MyId'. 
Now, if the GC is going ahead with 'MyId' deletion, will cause inconsistencies.


bq. In step 2 of the monitor (managing the chain), it says that the auditor 
notifies some other bookie that it needs to handle re-replication. How exactly 
does this notification happen? Bookies currently don't talk to each other 
directly. We would need to do this communication through zookeeper if we want 
to keep bookies decoupled.

Oh, seems not explained clearly in the docs. Yes, I'm trying to use ZK based 
communication.
As mentioned in the doc, Bookie will first create/acquire MyId and then he will 
be adding child watchers to his MyId. So, when the Auditor identifies failed 
bookie and adding its Id into the observer, in turn will notifies the observer 
bookie. 

Please see the following example:
Monitor chain is 01 <- 02 <- 03 <- 04 <- 05 <- 06 <- 01.    (<- symbol implies 
re-replica observer)
Say, 3, 5, 6 gone down and 1, 2, 4 are alive.
Auditor will be adding the failed bookie's Id to the observer bookie
4/3, 6/5, 1/6.

On child notification, 4 will start replication of 3.
On child notification 1 will start re-replication of 6 and 1 will also need to 
check is there any children present under 6 and if exists he will take care 
that node also. Here the chances of duplicates like, when 1 starting the action 
of 5, now immediately 6 has started and he will also start acting. Since there 
is no central co-ordination exists.

I'm thinking that, a bookie will start parsing and re-replication on:
# every child notification,
# also, on bookie startup, he will check any node exists under MyId for 
re-replication.


bq. The description says L00001_ip:port, but it is not clear if ip:port 
corresponds to the lock holder, in which case the lock znode wouldn't be unique

_How re-replication works?_
Re-replication cycle is shown below:
# All bookies will be watching children of '/ledgers/underreplicas'
# On child watch notification, read the children; Here the child znode name 
format is LedgerName_ip:port.
Here the ip:port is the failed bookie which has the 'LedgerName' entries.
# All the live bookies will try creating ephemeral znode 'lock'(zk distributed 
locking) under the znode 'LedgerName_ip:port'
# Whoever succeeds will start re-replication, say BK_X
# All the others will start watching on 
'/ledgers/underreplicas/LedgerName_ip:port/lock'
# When the BK_X finished re-replication of LedgerName_ip:port, he will update 
the ledger metadata. Then, delete the LedgerName_ip:port if re-replication is 
fully over. Otw (assume not able to fully re-replicate, please refer the 
example in the doc), he will remove the 'lock' under 
'/ledgers/underreplicas/LedgerName_ip:port' and others will get the 
notification.
# On lock release/delete notification, again others will compete each other and 
this cycle continues till the complete re-replication.


Assume 3's IP:PORT is 10.18.40.13:2167
Take the above example, consider 3 has failed and 4 identifies the ledger 
'L00001' has entries in 3.
4 will create a znode like : '/ledgers/underreplicas/L00001_10.18.40.13:2167'

Since all the live bookies are watching children of '/ledgers/underreplicas', 
every one will get the notification and acquiring lock for doing the 
re-replication. Only one bookie (say BK5) will be able to create ephemeral 
znode 'lock' under '/ledgers/underreplicas/L00001_10.18.40.13:2167', and tries 
re-replication. And all other bookies will add 
'/ledgers/underreplicas/L00001_10.18.40.13:2167/lock' watching to see the 
status of re-replication. Say after first round, if some more entires are still 
to be re-replicated, then the first replica(BK5) will update the ledger 
metadata(if any) and release the lock. Again this cycle continues till 
re-replication is fully over.
                
> Automatic recovery of under-replicated ledgers and its entries
> --------------------------------------------------------------
>
>                 Key: BOOKKEEPER-237
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-237
>             Project: Bookkeeper
>          Issue Type: New Feature
>          Components: bookkeeper-client, bookkeeper-server
>    Affects Versions: 4.0.0
>            Reporter: Rakesh R
>            Assignee: Rakesh R
>         Attachments: Auto Recovery Detection - distributed chain 
> approach.doc, Auto Recovery and Bookie sync-ups.pdf
>
>
> As per the current design of BookKeeper, if one of the BookKeeper server 
> dies, there is no automatic mechanism to identify and recover the under 
> replicated ledgers and its corresponding entries. This would lead to losing 
> the successfully written entries, which will be a critical problem in 
> sensitive systems. This document is trying to describe few proposals to 
> overcome these limitations. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to