[jira] [Commented] (BOOKKEEPER-237) Automatic recovery of under-replicated ledgers and its entries

Rakesh R (JIRA) Wed, 09 May 2012 00:06:40 -0700

    [ 
https://issues.apache.org/jira/browse/BOOKKEEPER-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271166#comment-13271166
 ]


Rakesh R commented on BOOKKEEPER-237:
-------------------------------------

Thanks again Flavio for the detailed info:)

{quote}
 I wouldn't expect many changes to the ensemble of a ledger, so my feeling is 
that the example on page 2 is a corner case, so I'm not sure we should optimize 
for such cases
 {quote}

I feel, should consider all the corner cases since WALs are too costly. Also we 
would be able to showcase BK as an efficient WAL tool.

{quote}
Second, the same example points out that entries can be become underreplicated 
with so many consecutive replacements, but at the same time the same bookies 
pop up later in future ensembles. Are you considering that the memory of a 
bookie is gone once it is removed from an ensemble? If not, then there is no 
need to re-establish the degree of replication
.....
Why doesn't it work if we operate at the bookie level?
{quote}

Yeah, its correct, when the Bookie comes back(either rejoins or restarted) it 
will be there in Bookie's memory. Only the exceptional case is, say a Bookie 
has few ledgers which are successfully written and unfortunately the current 
ledger writing is getting timedout. The client would reform the ensemble and 
continue writing. 
Here, only this ledger to be considered as under replicated as it may endup 
with partial entries and not in the Bookie level?

{quote}
Here is one proposal. Using ZK, we can create a chain of bookies, where each 
bookie watches the previous bookie in the sequence of sequential znodes. Let's 
call the watcher bookie the buddy of the watched bookie. If a bookie crashes, 
its buddy receives a notification and the buddy is responsible for replicating 
the content of the crashed bookie. After a crash, we of course need to restore 
the chain by finding other buddies. Also, there are some corner cases related 
to multiple failures that we would need to think about more carefully.
{quote}

Its good to see new ideas. Here, I have few concerns:
# As you pointed out needs to consider multiple crashes?
Assume Bookie chain : BK1->BK2->BK3->BK4->BK5. Say, BK2 & BK3 dies. BK4 doesn't 
knows about BK2. It would be even more painful, if many consecutive failures.
# Say current ledger writing is getting timedout as mentioned above? 
Here, consider a case where intermittent n/w fluctuations.
# Watcher Bookie might be replica holder of that ledger.
Assume Bookie chain : BK1->BK2->BK3->BK4->BK5. Say BK2 failed, BK3 would not be 
able to replicate the content as it may be an existing replica holder.

{quote}
The bottom line it that a distributed solution might be more robust than a 
centralized one, and it does not require a new independent entity or a 
specialized bookie.
{quote}
I feel, under replica detection should be centralized. He should be listening 
for the under replicas and raise alarm. So, whoever doesn't holds the ledger 
entry would takes from the queue and rereplicate to it. This would also help to 
avoid many concurrency due to multiple crashes.


{quote}
I like the idea in general of having different schedules, especially the one 
that errors an operation to the ledger upon a crash instead of changing the 
ensemble automatically.
{quote}
I would like to know more on this. IMHO, avoid the reformation within a ledger 
and throws specific exception back to the client, so that he would close the 
ledger and creates a new one. Still client would be able get the ensemble 
reformation/dynamic bookies on ledger level. My idea is to simplify the ledger 
parsing for detecting under replica ledger entries and identifying target 
replica Bookies.
                
> Automatic recovery of under-replicated ledgers and its entries
> --------------------------------------------------------------
>
>                 Key: BOOKKEEPER-237
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-237
>             Project: Bookkeeper
>          Issue Type: New Feature
>          Components: bookkeeper-client, bookkeeper-server
>    Affects Versions: 4.0.0
>            Reporter: Rakesh R
>            Assignee: Rakesh R
>         Attachments: Auto Recovery and Bookie sync-ups.pdf
>
>
> As per the current design of BookKeeper, if one of the BookKeeper server 
> dies, there is no automatic mechanism to identify and recover the under 
> replicated ledgers and its corresponding entries. This would lead to losing 
> the successfully written entries, which will be a critical problem in 
> sensitive systems. This document is trying to describe few proposals to 
> overcome these limitations. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-237) Automatic recovery of under-replicated ledgers and its entries

Reply via email to