[ 
https://issues.apache.org/jira/browse/BOOKKEEPER-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13290885#comment-13290885
 ] 

Ivan Kelly commented on BOOKKEEPER-237:
---------------------------------------

@Flavio,
A ledger is made of fragments; a fragment has a start id and an ensemble of 
bookies. A bookie is participating in a fragment if it is in this ensemble of 
bookies. Say we have bookies bA,bB,bC,bD,bE and ledgers 1-5, each with one 
fragment. The ledger fragments are.
F1: Ledger 1 - Entry1 - bD, bE, bC
F2: Ledger 2 - Entry1 - bE, bA, bC
F3: Ledger 3 - Entry1 - bD, bB, bC
F4: Ledger 4 - Entry1 - bA, bB, bE
F5: Ledger 5 - Entry1 - bE, bC, bD

bA gets the list of fragments it participates in, F2 & F4, from this it builds 
the fragment index,
bB -> F4
bC -> F2
bE -> F2, F4

bA watches /ledger/available for bookies disappearing. 
bE disappears.
bA sees that bE disappears, and runs a check on F2 and F4. It finds the bE 
replica is missing for each, so adds an underreplicated znode for it.

re: rebuilding, the loop of the recovery worker on each bookie can look like. 
{code}
while (true) {
   pickUnderreplicatedFragmentFromList();
   rereplicate();
}
{code}
A single bookie will only be rereplicating a single fragment at a time. As all 
bookies will be running the recovery worker, this automatically load balances.

@Rakesh
I was actually going through your patch when I came up with this. Will go back 
to looking at it now. I think there's a good bit of crossover.
                
> Automatic recovery of under-replicated ledgers and its entries
> --------------------------------------------------------------
>
>                 Key: BOOKKEEPER-237
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-237
>             Project: Bookkeeper
>          Issue Type: New Feature
>          Components: bookkeeper-client, bookkeeper-server
>    Affects Versions: 4.0.0
>            Reporter: Rakesh R
>            Assignee: Rakesh R
>         Attachments: Auto Recovery Detection - distributed chain 
> approach.doc, Auto Recovery and Bookie sync-ups.pdf
>
>
> As per the current design of BookKeeper, if one of the BookKeeper server 
> dies, there is no automatic mechanism to identify and recover the under 
> replicated ledgers and its corresponding entries. This would lead to losing 
> the successfully written entries, which will be a critical problem in 
> sensitive systems. This document is trying to describe few proposals to 
> overcome these limitations. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to