[ 
https://issues.apache.org/jira/browse/BOOKKEEPER-745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14013989#comment-14013989
 ] 

Rakesh R commented on BOOKKEEPER-745:
-------------------------------------

{quote}
The reason I put it after the generateBookie2LedgersIndex() is that this method 
can run for a long time. So it could be running when a rolling restart begins, 
and then the ledgers would be marked while autoreplication is disabled. Putting 
the wait after, and having the bk2ledger map a little stale is ok though, 
because we are only looking for the ledgers which are on the bookie that failed
{quote}
I was seeing the behaviour of this flow with IP_to_Hostname renaming tool 
BOOKKEEPER-639. In that case after enabling autoreplication, auditor will 
compare the old ledger bookie ids(ip as bookieId) with the new bookie available 
set(hostname as bookieId) and publish as lost bookies. Now auditor will publish 
these bookies and their ledgers. Now the RWs compete each other for urLock and 
will just do markLedgerUnderreplicated. In worst case if there are many ledgers 
in the system unnecessary re-replication cycle will happen for long time for 
all the ledgers. To avoid this, I think simple approach is just reverse these 
statements or we could find some other way ?

{code}
Auditor.java
        List<String> availableBookies = getAvailableBookies();
        // find lost bookies
        Set<String> knownBookies = ledgerDetails.keySet();
        Collection<String> lostBookies = CollectionUtils.subtract(knownBookies,
                availableBookies);
{code}

> Fix for false reports of ledger unreplication during rolling restarts.
> ----------------------------------------------------------------------
>
>                 Key: BOOKKEEPER-745
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-745
>             Project: Bookkeeper
>          Issue Type: Bug
>          Components: bookkeeper-auto-recovery
>            Reporter: Ivan Kelly
>            Assignee: Ivan Kelly
>             Fix For: 4.3.0, 4.2.3
>
>         Attachments: 
> 0001-Fix-for-false-reports-of-ledger-unreplication-.trunk.patch, 
> 0001-Fix-for-false-reports-of-ledger-unreplication-.trunk.patch, 
> 0002-Fix-for-false-reports-of-ledger-unreplication-.trunk.patch, 
> 0004-Fix-for-false-reports-of-ledger-unreplication-.trunk.patch, 
> 0006-Fix-for-false-reports-of-ledger-unreplicat.branch4.2.patch
>
>
> The bug occurred because there was no check if rereplication was enabled or 
> not when the auditor came online. When the auditor comes online it does a 
> check of which bookies are up and marks the ledgers on missing bookies as 
> underreplicated. In the false report case, the auditor was running after each 
> bookie was bounced due to the way leader election for the auditor works. And 
> since one bookie was down since you're bouncing the server, all ledgers on 
> that bookie will get marked as underreplicated.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to