[ 
https://issues.apache.org/jira/browse/BOOKKEEPER-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15550139#comment-15550139
 ] 

Rithin Shetty commented on BOOKKEEPER-946:
------------------------------------------

Yes, that's what is happening. I didn't mean to do this; there is a bug in the 
code. Thanks for catching it. What we want is as long as we are doing rolling 
upgrade, i.e. only one bookie goes down at a time, the audit should be delayed 
by the configured time period. I'll send out the updated code where an audit is 
scheduled when the first bookie goes down. Subsequently if that bookie is 
brought up and a different bookie goes down, the audit is not started because 
there is only one bookie that is down at this time too. The audit that was 
scheduled when the first bookie went down will finally run after the configured 
delay. So the way to use this feature would be to set the delay to something 
like 1 hour and then finish the rolling upgrade of the cluster within that 
hour. An audit will run at the end of the hour which will make sure that no 
ledgers are missing.

> Provide an option to delay auto recovery of lost bookies
> --------------------------------------------------------
>
>                 Key: BOOKKEEPER-946
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-946
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.5.0
>            Reporter: Rithin Shetty
>            Assignee: Rithin Shetty
>            Priority: Minor
>             Fix For: 4.5.0
>
>         Attachments: 
> org.apache.bookkeeper.replication.AuditorLedgerCheckerTest-output.txt, 
> org.apache.bookkeeper.replication.AuditorLedgerCheckerTest-output.txt
>
>
> If auto recovery is enabled, and a bookie goes down for upgrade or even if it 
> looses zk connection intermittently, the auditor detects it as a lost bookie 
> and starts under replication detection and the replication workers on other 
> bookie nodes start replicating the under replicated ledgers. All of this 
> stops once the bookie comes up but by then a few ledgers would get 
> replicated. Given the fact that we have multiple copies of data, it is 
> probably not necessary to start the recovery as soon as a bookie goes down. 
> We can probably wait for an hour or so and then start recovery. This should 
> cover cases like planned upgrade, intermittent network connectivity loss, 
> etc. The amount of time to wait can be an option and the default would be to 
> not wait at all(i.e. retain current behavior).
> Of course, if more than one bookie goes down within a short interval, we 
> could decide to start auto recovery without waiting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to