[
https://issues.apache.org/jira/browse/BOOKKEEPER-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13290273#comment-13290273
]
Ivan Kelly commented on BOOKKEEPER-237:
---------------------------------------
I think the chaining mechanism over-complicates things. In fact i don't think
we should be bookie focused at all. Rather we should focus on the ledgers and
keeping them fully replicated. If we detect underreplication for a ledger, we
have detected the loss of the bookie anyhow.
I propose an alternative approach.
Each bookie has a Recovery worker running.
Bookies elect a Auditor among themselves.
Auditor
- Scans the full list of ledgers periodically.
- Builds an inmemory bookie -> ledger index
- Watches /ledgers/available
- Periodically scan all ledgers
On bookie failure:
- Get ledgers for bookies from index.
- Scan each of these ledgers.
Scanning a ledger will return a number of LedgerFragmentReplicas corresponding
to a missing ledger fragment replica.
These are stored in
/ledgers/underreplicated/L<ledgerid>-E<startentry>-R<replicaindex>
Recovery workers on each bookie reads list from /ledgers/underreplicated/,
picks an entry, locks it and rereplicates.
If a recovery worker crashes half way, its lock will evaporate, and the new
recovery worker will be able to do the replication.
> Automatic recovery of under-replicated ledgers and its entries
> --------------------------------------------------------------
>
> Key: BOOKKEEPER-237
> URL: https://issues.apache.org/jira/browse/BOOKKEEPER-237
> Project: Bookkeeper
> Issue Type: New Feature
> Components: bookkeeper-client, bookkeeper-server
> Affects Versions: 4.0.0
> Reporter: Rakesh R
> Assignee: Rakesh R
> Attachments: Auto Recovery Detection - distributed chain
> approach.doc, Auto Recovery and Bookie sync-ups.pdf
>
>
> As per the current design of BookKeeper, if one of the BookKeeper server
> dies, there is no automatic mechanism to identify and recover the under
> replicated ledgers and its corresponding entries. This would lead to losing
> the successfully written entries, which will be a critical problem in
> sensitive systems. This document is trying to describe few proposals to
> overcome these limitations.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira