[ https://issues.apache.org/jira/browse/BOOKKEEPER-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15691492#comment-15691492 ]
ASF GitHub Bot commented on BOOKKEEPER-946: ------------------------------------------- GitHub user rithin-shetty opened a pull request: https://github.com/apache/bookkeeper/pull/82 BOOKKEEPER-946 Provide an option to delay auto recovery of lost bookies Fixing a bug in the test AuditorLedgerCheckerTest.testDelayedAuditOfLostBookies which fails sometimes with: AuditorLedgerCheckerTest.testDelayedAuditOfLostBookies:367->_testDelayedAuditOfLostBookies:345 audit of lost bookie isn't delayed You can merge this pull request into a Git repository by running: $ git pull https://github.com/rithin-shetty/bookkeeper audit_delay_fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/bookkeeper/pull/82.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #82 ---- commit db3599645dc40fc0478094a862f14c4a23274f48 Author: Rithin <rithin.she...@salesforce.com> Date: 2016-11-23T18:57:58Z BOOKKEEPER-946 Provide an option to delay auto recovery of lost bookies Fixing a bug in the test AuditorLedgerCheckerTest.testDelayedAuditOfLostBookies which fails sometimes with: AuditorLedgerCheckerTest.testDelayedAuditOfLostBookies:367->_testDelayedAuditOfLostBookies:345 audit of lost bookie isn't delayed ---- > Provide an option to delay auto recovery of lost bookies > -------------------------------------------------------- > > Key: BOOKKEEPER-946 > URL: https://issues.apache.org/jira/browse/BOOKKEEPER-946 > Project: Bookkeeper > Issue Type: Improvement > Components: bookkeeper-server > Affects Versions: 4.5.0 > Reporter: Rithin Shetty > Assignee: Rithin Shetty > Fix For: 4.5.0 > > Attachments: > org.apache.bookkeeper.replication.AuditorLedgerCheckerTest-output.txt, > org.apache.bookkeeper.replication.AuditorLedgerCheckerTest-output.txt > > > If auto recovery is enabled, and a bookie goes down for upgrade or even if it > looses zk connection intermittently, the auditor detects it as a lost bookie > and starts under replication detection and the replication workers on other > bookie nodes start replicating the under replicated ledgers. All of this > stops once the bookie comes up but by then a few ledgers would get > replicated. Given the fact that we have multiple copies of data, it is > probably not necessary to start the recovery as soon as a bookie goes down. > We can probably wait for an hour or so and then start recovery. This should > cover cases like planned upgrade, intermittent network connectivity loss, > etc. The amount of time to wait can be an option and the default would be to > not wait at all(i.e. retain current behavior). > Of course, if more than one bookie goes down within a short interval, we > could decide to start auto recovery without waiting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)