Rithin Shetty created BOOKKEEPER-946:
----------------------------------------

             Summary: Provide an option to delay auto recovery of lost bookies
                 Key: BOOKKEEPER-946
                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-946
             Project: Bookkeeper
          Issue Type: Improvement
          Components: bookkeeper-server
    Affects Versions: 4.5.0
            Reporter: Rithin Shetty
            Assignee: Rithin Shetty
            Priority: Minor
             Fix For: 4.5.0


If auto recovery is enabled, and a bookie goes down for upgrade or even if it 
looses zk connection intermittently, the auditor detects it as a lost bookie 
and starts under replication detection and the replication workers on other 
bookie nodes start replicating the under replicated ledgers. All of this stops 
once the bookie comes up but by then a few ledgers would get replicated. Given 
the fact that we have multiple copies of data, it is probably not necessary to 
start the recovery as soon as a bookie goes down. We can probably wait for an 
hour or so and then start recovery. This should cover cases like planned 
upgrade, intermittent network connectivity loss, etc. The amount of time to 
wait can be an option and the default would be to not wait at all(i.e. retain 
current behavior).

Of course, if more than one bookie goes down within a short interval, we could 
decide to start auto recovery without waiting.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to