[ 
https://issues.apache.org/jira/browse/BOOKKEEPER-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15650885#comment-15650885
 ] 

Hudson commented on BOOKKEEPER-946:
-----------------------------------

SUCCESS: Integrated in Jenkins build bookkeeper-master #1560 (See 
[https://builds.apache.org/job/bookkeeper-master/1560/])
BOOKKEEPER-946: Provide an option to delay auto recovery of lost bookies 
(sijie: rev 0abf37c64ced0fe49a6470bc0e2be632e47902d6)
* (edit) 
bookkeeper-server/src/test/java/org/apache/bookkeeper/replication/AuditorLedgerCheckerTest.java
* (edit) 
bookkeeper-server/src/main/java/org/apache/bookkeeper/conf/ServerConfiguration.java
* (edit) bookkeeper-server/conf/bk_server.conf
* (edit) 
bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/ReplicationStats.java
* (edit) 
bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/Auditor.java


> Provide an option to delay auto recovery of lost bookies
> --------------------------------------------------------
>
>                 Key: BOOKKEEPER-946
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-946
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-server
>    Affects Versions: 4.5.0
>            Reporter: Rithin Shetty
>            Assignee: Rithin Shetty
>            Priority: Minor
>             Fix For: 4.5.0
>
>         Attachments: 
> org.apache.bookkeeper.replication.AuditorLedgerCheckerTest-output.txt, 
> org.apache.bookkeeper.replication.AuditorLedgerCheckerTest-output.txt
>
>
> If auto recovery is enabled, and a bookie goes down for upgrade or even if it 
> looses zk connection intermittently, the auditor detects it as a lost bookie 
> and starts under replication detection and the replication workers on other 
> bookie nodes start replicating the under replicated ledgers. All of this 
> stops once the bookie comes up but by then a few ledgers would get 
> replicated. Given the fact that we have multiple copies of data, it is 
> probably not necessary to start the recovery as soon as a bookie goes down. 
> We can probably wait for an hour or so and then start recovery. This should 
> cover cases like planned upgrade, intermittent network connectivity loss, 
> etc. The amount of time to wait can be an option and the default would be to 
> not wait at all(i.e. retain current behavior).
> Of course, if more than one bookie goes down within a short interval, we 
> could decide to start auto recovery without waiting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to