[
https://issues.apache.org/jira/browse/BOOKKEEPER-946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sijie Guo resolved BOOKKEEPER-946.
----------------------------------
Resolution: Fixed
Issue resolved by merging pull request 82
[https://github.com/apache/bookkeeper/pull/82]
{noformat}
commit 669ab4ac32bcbf6b3d883a07ed942d36d25b8a6e
Author: Rithin <[email protected]>
AuthorDate: Fri Dec 16 17:44:24 2016 -0800
Commit: Sijie Guo <[email protected]>
CommitDate: Fri Dec 16 17:44:24 2016 -0800
BOOKKEEPER-946: Provide an option to delay auto recovery of lost bookies
Fixing a bug in the test
AuditorLedgerCheckerTest.testDelayedAuditOfLostBookies which
fails sometimes with:
AuditorLedgerCheckerTest.testDelayedAuditOfLostBookies:367->_testDelayedAuditOfLostBookies:345
audit of lost bookie isn't delayed
Author: Rithin <[email protected]>
Reviewers: Enrico Olivelli <[email protected]>, Sijie Guo
<[email protected]>
Closes #82 from rithin-shetty/audit_delay_fix
{noformat}
> Provide an option to delay auto recovery of lost bookies
> --------------------------------------------------------
>
> Key: BOOKKEEPER-946
> URL: https://issues.apache.org/jira/browse/BOOKKEEPER-946
> Project: Bookkeeper
> Issue Type: Improvement
> Components: bookkeeper-server
> Affects Versions: 4.5.0
> Reporter: Rithin Shetty
> Assignee: Rithin Shetty
> Fix For: 4.5.0
>
> Attachments:
> org.apache.bookkeeper.replication.AuditorLedgerCheckerTest-output.txt,
> org.apache.bookkeeper.replication.AuditorLedgerCheckerTest-output.txt
>
>
> If auto recovery is enabled, and a bookie goes down for upgrade or even if it
> looses zk connection intermittently, the auditor detects it as a lost bookie
> and starts under replication detection and the replication workers on other
> bookie nodes start replicating the under replicated ledgers. All of this
> stops once the bookie comes up but by then a few ledgers would get
> replicated. Given the fact that we have multiple copies of data, it is
> probably not necessary to start the recovery as soon as a bookie goes down.
> We can probably wait for an hour or so and then start recovery. This should
> cover cases like planned upgrade, intermittent network connectivity loss,
> etc. The amount of time to wait can be an option and the default would be to
> not wait at all(i.e. retain current behavior).
> Of course, if more than one bookie goes down within a short interval, we
> could decide to start auto recovery without waiting.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)