Siddharth Sunil Boobna created BOOKKEEPER-889:
-------------------------------------------------

             Summary: BookKeeper client should try not to use bookies with 
errors/timeouts when forming a new ensemble
                 Key: BOOKKEEPER-889
                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-889
             Project: Bookkeeper
          Issue Type: Improvement
          Components: bookkeeper-client
    Affects Versions: 4.3.2
            Reporter: Siddharth Sunil Boobna


Due to various issues (slow disks, network issues, bugs, etc), the bookkeeper 
can be slow or unresponsive for extended period of times. During this time, r/w 
operations will fail/timeout and ledgers will create a new segment and form a 
new ensemble replacing this bookie. For new ledgers, it might still pick up 
this bookie or we can replace this bookie with another faulty bookie if we have 
multiple faulty bookies. 
The BK client should keep stats about these failure rates for all the bookies 
and it should "quarantine" failing bookies for a certain amount of time. Once a 
bookie is quarantined, it will not be picked up in forming a new ensemble, 
unless no other "healthy" bookies are available.

Solution:
Keep a counter of errors in the bookie client pool and periodically check for 
number of errors in a given time span and mark these bookies as "quarantined" 
in the BookieWatcher.
In the BookieWatcher, try to create an ensemble list excluding the quarantined 
bookies and if that fails, fall back to an empty exclusion list.
We will also remove the bookies from the quarantined list after a configurable 
period of time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to