[
https://issues.apache.org/jira/browse/BOOKKEEPER-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13290885#comment-13290885
]
Ivan Kelly commented on BOOKKEEPER-237:
---------------------------------------
@Flavio,
A ledger is made of fragments; a fragment has a start id and an ensemble of
bookies. A bookie is participating in a fragment if it is in this ensemble of
bookies. Say we have bookies bA,bB,bC,bD,bE and ledgers 1-5, each with one
fragment. The ledger fragments are.
F1: Ledger 1 - Entry1 - bD, bE, bC
F2: Ledger 2 - Entry1 - bE, bA, bC
F3: Ledger 3 - Entry1 - bD, bB, bC
F4: Ledger 4 - Entry1 - bA, bB, bE
F5: Ledger 5 - Entry1 - bE, bC, bD
bA gets the list of fragments it participates in, F2 & F4, from this it builds
the fragment index,
bB -> F4
bC -> F2
bE -> F2, F4
bA watches /ledger/available for bookies disappearing.
bE disappears.
bA sees that bE disappears, and runs a check on F2 and F4. It finds the bE
replica is missing for each, so adds an underreplicated znode for it.
re: rebuilding, the loop of the recovery worker on each bookie can look like.
{code}
while (true) {
pickUnderreplicatedFragmentFromList();
rereplicate();
}
{code}
A single bookie will only be rereplicating a single fragment at a time. As all
bookies will be running the recovery worker, this automatically load balances.
@Rakesh
I was actually going through your patch when I came up with this. Will go back
to looking at it now. I think there's a good bit of crossover.
> Automatic recovery of under-replicated ledgers and its entries
> --------------------------------------------------------------
>
> Key: BOOKKEEPER-237
> URL: https://issues.apache.org/jira/browse/BOOKKEEPER-237
> Project: Bookkeeper
> Issue Type: New Feature
> Components: bookkeeper-client, bookkeeper-server
> Affects Versions: 4.0.0
> Reporter: Rakesh R
> Assignee: Rakesh R
> Attachments: Auto Recovery Detection - distributed chain
> approach.doc, Auto Recovery and Bookie sync-ups.pdf
>
>
> As per the current design of BookKeeper, if one of the BookKeeper server
> dies, there is no automatic mechanism to identify and recover the under
> replicated ledgers and its corresponding entries. This would lead to losing
> the successfully written entries, which will be a critical problem in
> sensitive systems. This document is trying to describe few proposals to
> overcome these limitations.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira