[
https://issues.apache.org/jira/browse/BOOKKEEPER-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13270492#comment-13270492
]
Ivan Kelly commented on BOOKKEEPER-237:
---------------------------------------
{quote}
I have gone through BookKeeper-208, 'write quorum' is using the RRDS algo for
interleaving ?
{quote}
Yes, but if write quorum size is the same as ensemble size, the message will be
sent to all
{quote}
Also will reform the ensemble if any Bookie in this ensemble is slow/dead and
timedout ?
{quote}
In the case of a slow/dead bookie, the write can continue writing without
issue. When the connection to the slow/dead bookie times out, the client will
try to replace the bookie.
{quote}
But, I was thinking to avoid the interleaving of entries and ensemble
reformation on write entry failures to make recovery simple. Here the idea is,
all the ensemble Bookies will be having the replicas in best case. Ack quorum
also set to the ensemble size. Here, assured replica = quorum size.
Please see the doc section '1.4.4 Example: How it works'
For any ledger,
in best case, total replicas = ensemble size
in worst case, total replicas = quorum size, which is the assured/majority
replicas.
{quote}
This will behave the same if write quorum is the same as ensemble size, and ack
quorum is the same as "assured replicas".
{quote}
Also, I feel able to handle intermittent network problems as by default this
approach will tolerate (ensemble - quorum) failures.
{quote}
For intermittent network problems, this can be handled by setting the socket
timeout to a large value.
{quote}
How client will read?
For an inprogress ledger, bkclient looks to the entire ensemble for reading.
{quote}
For a non closed ledger, a read is sent once to all bookies in the ensemble to
get the lastEntryConfirmed. Once we have this we read as if we are reading from
a closed ledger in which lastEntryConfirmed is the last entry.
{quote}
For a closed ledger, bkclient looks to the CLOSED ensemble for reading.
{quote}
For a closed ledger it follows a roundrobin reading pattern as it does today.
For each entry, it tries to read from a single bookie. If the read fails, it
tries the next bookie in the roundrobin.
{quote}
I was trying to avoid multiple ZK calls to read ledger metadata for the
detection. Only for the failed ledgers, Accountant would read the ledger
metadata and initiate the rereplication.
{quote}
ZK is a read optimised system, so I think doing multiple reads on it is more
effecient than maintaining another list.
{quote}
Also, the map contains the _inprogressreplica information, so the new
Accountant would be knowing about the already initiated rereplicas.
...
Yeah, its good. If I understand correctly, instead of assigning work to the
Bookie, Accountant would keep the under replicated ledgers. Bookies (which is
not an existing replica holder for that ledger) would takes the unreplicated
ledger and after finish send ack to the Accountant. I feel, we should define
proper locking here to avoid concurrency?
{quote}
Each rereplicating ledger can have a lock on it. When a bookie goes to
rereplicate a ledger, it writes a ephemereal sequential znode as a child of the
rereplicated ledger znode. Then it checks if it has the lowest sequence number.
If not, it deletes the znode (as this indicates someone else is rereplicating
that particular ledger). If it has the lowest sequence number, it has just
acquired the lock. Then it checks what rereplication needs to happen on the
ledger, it rereplicates, and finally it deletes its lock and then the ledger
rereplication znode. I think this removes the need for _inprogress znodes also.
> Automatic recovery of under-replicated ledgers and its entries
> --------------------------------------------------------------
>
> Key: BOOKKEEPER-237
> URL: https://issues.apache.org/jira/browse/BOOKKEEPER-237
> Project: Bookkeeper
> Issue Type: New Feature
> Components: bookkeeper-client, bookkeeper-server
> Affects Versions: 4.0.0
> Reporter: Rakesh R
> Assignee: Rakesh R
> Attachments: Auto Recovery and Bookie sync-ups.pdf
>
>
> As per the current design of BookKeeper, if one of the BookKeeper server
> dies, there is no automatic mechanism to identify and recover the under
> replicated ledgers and its corresponding entries. This would lead to losing
> the successfully written entries, which will be a critical problem in
> sensitive systems. This document is trying to describe few proposals to
> overcome these limitations.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira