[ 
https://issues.apache.org/jira/browse/BOOKKEEPER-365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13434177#comment-13434177
 ] 

Sijie Guo commented on BOOKKEEPER-365:
--------------------------------------

I would try to summarize the problem of recovery read causing by current read 
strategy.

suppose A, B, C is the quorum that an entry try to read.

1) if A, B returns NoSuchEntry/NoSuchLedger, C couldn't connect. the read 
response return CouldNotConnectException.
2) if A couldn't connect, B, C returns NoSuchEntry/NoSuchLedger. the read 
response return NoSuchEntry/NoSuchLedger.

LedgerRecovery treats 1) as failure would not close ledger while treats 2) as 
normal case to close the ledger.

But neither 1) nor 2) acts correctly.

For 1), if recovery read tries read an non-existed entry, all A, B, C don't 
have the entry. if C goes down't forever, BookKeeperAdmin runs BookieRecovery 
to replace C. But it still can't close the ledger to proceed recovery. so the 
ledger would not be available for read.

For 2), closing the ledger would cause entry loss if A encountering a transient 
failure such as network partition.

One possible idea for recovery read of last entry, we only close the ledger 
when received NoSuchLedger/NoSuchEntry from all quorums, which resolves 2). But 
it still have problem 1) if a machine is gone forever. for 1), the ledger 
status is undetermined. it would cause lots of ledgers are unreadable (could 
not be open) if we try to replace a bookie (using BookieAdmin) which happens to 
be the last bookie in last ensemble. 

Just wrote down my previous ideas. Welcome for comments.
                
> Ledger will never recover if one of the quorum bookie is down forever and 
> others dont have entry
> ------------------------------------------------------------------------------------------------
>
>                 Key: BOOKKEEPER-365
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-365
>             Project: Bookkeeper
>          Issue Type: Bug
>    Affects Versions: 4.0.0, 4.1.0
>            Reporter: Sijie Guo
>             Fix For: 4.2.0
>
>
> As discussed in BOOKKEEPER-355, current fix to handle the below issue is not 
> correct. Need to find out new solution
> If some bookies of a quorum gone forever, some bookies of this quorum are 
> still alive but doesn't have that entry (NoSuchEntry or NoSuchLedger), then 
> the ledger doesn't have any evidence to recovery/close it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to