hangc0276 opened a new pull request, #4070:
URL: https://github.com/apache/bookkeeper/pull/4070

   ### Motivation
   When triggering one bookie decommission, the bookie check max interval is 10 
minutes. 
   ```
   2023-08-10T13:56:08,911-0400 [main] INFO  
org.apache.bookkeeper.client.BookKeeperAdmin - Resetting 
LostBookieRecoveryDelay value: 0, to kickstart audit task
   2023-08-10T13:56:50,793-0400 [main] INFO  
org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to 
be rereplicated: 23140
   2023-08-10T14:08:47,350-0400 [main] INFO  
org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to 
be rereplicated: 2984
   2023-08-10T14:19:02,330-0400 [main] INFO  
org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to 
be rereplicated: 2984
   2023-08-10T14:29:17,332-0400 [main] INFO  
org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to 
be rereplicated: 2984
   2023-08-10T14:39:32,395-0400 [main] INFO  
org.apache.bookkeeper.client.BookKeeperAdmin - Count of Ledgers which need to 
be rereplicated: 2984
   ```
   
   It has the following issues:
   - Each check needs to wait 10 minutes if the waiting-to-be-replicated 
ledgers count is greater than 60, which is too much for small bookie 
decommission. For example, the bookie has 70 ledgers that need to be replicated.
   - We set each bookie replicate time to 10s. For some ledgers with few data, 
such as 100KB, it only takes 2 or 3 seconds to replicate.
   - The ledgers count waiting to be replicated in the first round is 
inaccurate because those ledgers are not validated by 
`validateBookieIsNotPartOfEnsemble`
   - The first count of need to be replicated ledgers is  `23140`, but after 10 
minutes, the ledger count is 2984. But the first check interval is calculated 
based on `23140`. 
   
   
   ### Changes
   - Reduce the max check interval from 10 minutes to 5 minutes
   - Reduce the `sleepTimePerLedger` from 10 seconds to 3 seconds
   - Trigger `validateBookieIsNotPartOfEnsemble` check in the first round 
before going to sleep to keep the count of ledgers waiting for replication 
accurate.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to