[
https://issues.apache.org/jira/browse/BOOKKEEPER-365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13434177#comment-13434177
]
Sijie Guo commented on BOOKKEEPER-365:
--------------------------------------
I would try to summarize the problem of recovery read causing by current read
strategy.
suppose A, B, C is the quorum that an entry try to read.
1) if A, B returns NoSuchEntry/NoSuchLedger, C couldn't connect. the read
response return CouldNotConnectException.
2) if A couldn't connect, B, C returns NoSuchEntry/NoSuchLedger. the read
response return NoSuchEntry/NoSuchLedger.
LedgerRecovery treats 1) as failure would not close ledger while treats 2) as
normal case to close the ledger.
But neither 1) nor 2) acts correctly.
For 1), if recovery read tries read an non-existed entry, all A, B, C don't
have the entry. if C goes down't forever, BookKeeperAdmin runs BookieRecovery
to replace C. But it still can't close the ledger to proceed recovery. so the
ledger would not be available for read.
For 2), closing the ledger would cause entry loss if A encountering a transient
failure such as network partition.
One possible idea for recovery read of last entry, we only close the ledger
when received NoSuchLedger/NoSuchEntry from all quorums, which resolves 2). But
it still have problem 1) if a machine is gone forever. for 1), the ledger
status is undetermined. it would cause lots of ledgers are unreadable (could
not be open) if we try to replace a bookie (using BookieAdmin) which happens to
be the last bookie in last ensemble.
Just wrote down my previous ideas. Welcome for comments.
> Ledger will never recover if one of the quorum bookie is down forever and
> others dont have entry
> ------------------------------------------------------------------------------------------------
>
> Key: BOOKKEEPER-365
> URL: https://issues.apache.org/jira/browse/BOOKKEEPER-365
> Project: Bookkeeper
> Issue Type: Bug
> Affects Versions: 4.0.0, 4.1.0
> Reporter: Sijie Guo
> Fix For: 4.2.0
>
>
> As discussed in BOOKKEEPER-355, current fix to handle the below issue is not
> correct. Need to find out new solution
> If some bookies of a quorum gone forever, some bookies of this quorum are
> still alive but doesn't have that entry (NoSuchEntry or NoSuchLedger), then
> the ledger doesn't have any evidence to recovery/close it.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira