[
https://issues.apache.org/jira/browse/BOOKKEEPER-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13281470#comment-13281470
]
Rakesh R commented on BOOKKEEPER-237:
-------------------------------------
Thanks Flavio for the comments. I hope the following will give more idea.
bq. Just to confirm, elements in the myId list have to be deleted manually,
yes? If a node is decommissioned, then I suppose we will want to delete from
the list.
Yes, since these are persistent nodes need to delete manually or will be able
to think automatic deletion after the full re-replication of that
failed/decommissioned bookie. I feel, here again chances of race conditions to
be avoided, like garbage collector(GC: for deleting inactive id) decides to
delete the node, meanwhile the failed bookie rejoins and acquire his 'MyId'.
Now, if the GC is going ahead with 'MyId' deletion, will cause inconsistencies.
bq. In step 2 of the monitor (managing the chain), it says that the auditor
notifies some other bookie that it needs to handle re-replication. How exactly
does this notification happen? Bookies currently don't talk to each other
directly. We would need to do this communication through zookeeper if we want
to keep bookies decoupled.
Oh, seems not explained clearly in the docs. Yes, I'm trying to use ZK based
communication.
As mentioned in the doc, Bookie will first create/acquire MyId and then he will
be adding child watchers to his MyId. So, when the Auditor identifies failed
bookie and adding its Id into the observer, in turn will notifies the observer
bookie.
Please see the following example:
Monitor chain is 01 <- 02 <- 03 <- 04 <- 05 <- 06 <- 01. (<- symbol implies
re-replica observer)
Say, 3, 5, 6 gone down and 1, 2, 4 are alive.
Auditor will be adding the failed bookie's Id to the observer bookie
4/3, 6/5, 1/6.
On child notification, 4 will start replication of 3.
On child notification 1 will start re-replication of 6 and 1 will also need to
check is there any children present under 6 and if exists he will take care
that node also. Here the chances of duplicates like, when 1 starting the action
of 5, now immediately 6 has started and he will also start acting. Since there
is no central co-ordination exists.
I'm thinking that, a bookie will start parsing and re-replication on:
# every child notification,
# also, on bookie startup, he will check any node exists under MyId for
re-replication.
bq. The description says L00001_ip:port, but it is not clear if ip:port
corresponds to the lock holder, in which case the lock znode wouldn't be unique
_How re-replication works?_
Re-replication cycle is shown below:
# All bookies will be watching children of '/ledgers/underreplicas'
# On child watch notification, read the children; Here the child znode name
format is LedgerName_ip:port.
Here the ip:port is the failed bookie which has the 'LedgerName' entries.
# All the live bookies will try creating ephemeral znode 'lock'(zk distributed
locking) under the znode 'LedgerName_ip:port'
# Whoever succeeds will start re-replication, say BK_X
# All the others will start watching on
'/ledgers/underreplicas/LedgerName_ip:port/lock'
# When the BK_X finished re-replication of LedgerName_ip:port, he will update
the ledger metadata. Then, delete the LedgerName_ip:port if re-replication is
fully over. Otw (assume not able to fully re-replicate, please refer the
example in the doc), he will remove the 'lock' under
'/ledgers/underreplicas/LedgerName_ip:port' and others will get the
notification.
# On lock release/delete notification, again others will compete each other and
this cycle continues till the complete re-replication.
Assume 3's IP:PORT is 10.18.40.13:2167
Take the above example, consider 3 has failed and 4 identifies the ledger
'L00001' has entries in 3.
4 will create a znode like : '/ledgers/underreplicas/L00001_10.18.40.13:2167'
Since all the live bookies are watching children of '/ledgers/underreplicas',
every one will get the notification and acquiring lock for doing the
re-replication. Only one bookie (say BK5) will be able to create ephemeral
znode 'lock' under '/ledgers/underreplicas/L00001_10.18.40.13:2167', and tries
re-replication. And all other bookies will add
'/ledgers/underreplicas/L00001_10.18.40.13:2167/lock' watching to see the
status of re-replication. Say after first round, if some more entires are still
to be re-replicated, then the first replica(BK5) will update the ledger
metadata(if any) and release the lock. Again this cycle continues till
re-replication is fully over.
> Automatic recovery of under-replicated ledgers and its entries
> --------------------------------------------------------------
>
> Key: BOOKKEEPER-237
> URL: https://issues.apache.org/jira/browse/BOOKKEEPER-237
> Project: Bookkeeper
> Issue Type: New Feature
> Components: bookkeeper-client, bookkeeper-server
> Affects Versions: 4.0.0
> Reporter: Rakesh R
> Assignee: Rakesh R
> Attachments: Auto Recovery Detection - distributed chain
> approach.doc, Auto Recovery and Bookie sync-ups.pdf
>
>
> As per the current design of BookKeeper, if one of the BookKeeper server
> dies, there is no automatic mechanism to identify and recover the under
> replicated ledgers and its corresponding entries. This would lead to losing
> the successfully written entries, which will be a critical problem in
> sensitive systems. This document is trying to describe few proposals to
> overcome these limitations.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira