hangc0276 opened a new pull request, #4070: URL: https://github.com/apache/bookkeeper/pull/4070
### Motivation When triggering one bookie decommission, the bookie check max interval is 10 minutes. ``` 2023-08-10T13:56:08,911-0400 [main] INFO org.apache.bookkeeper.client.BookKeeperAdmin - Resetting LostBookieRecoveryDelay value: 0, to kickstart audit task 2023-08-10T13:56:50,793-0400 [main] INFO org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 23140 2023-08-10T14:08:47,350-0400 [main] INFO org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 2984 2023-08-10T14:19:02,330-0400 [main] INFO org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 2984 2023-08-10T14:29:17,332-0400 [main] INFO org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 2984 2023-08-10T14:39:32,395-0400 [main] INFO org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to be rereplicated: 2984 ``` It has the following issues: - Each check needs to wait 10 minutes if the waiting-to-be-replicated ledgers count is greater than 60, which is too much for small bookie decommission. For example, the bookie has 70 ledgers that need to be replicated. - We set each bookie replicate time to 10s. For some ledgers with few data, such as 100KB, it only takes 2 or 3 seconds to replicate. - The ledgers count waiting to be replicated in the first round is inaccurate because those ledgers are not validated by `validateBookieIsNotPartOfEnsemble` - The first count of need to be replicated ledgers is `23140`, but after 10 minutes, the ledger count is 2984. But the first check interval is calculated based on `23140`. ### Changes - Reduce the max check interval from 10 minutes to 5 minutes - Reduce the `sleepTimePerLedger` from 10 seconds to 3 seconds - Trigger `validateBookieIsNotPartOfEnsemble` check in the first round before going to sleep to keep the count of ledgers waiting for replication accurate. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
