frankjkelly opened a new issue #13971:
URL: https://github.com/apache/pulsar/issues/13971


   **Is your feature request related to a problem? Please describe.**
   Recently two of our 9 bookies were quarantined and we did not know until we 
saw logs in the Broker.
   
   ```
   platform-pulsar-broker-1 platform-pulsar-broker 16:53:05.284 
[BookKeeperClientScheduler-OrderedScheduler-0-0] WARN  
org.apache.bookkeeper.client.BookieWatcherImpl - 
   Bookie 
platform-pulsar-bookkeeper-1.platform-pulsar-bookkeeper.cogito-load.svc.cluster.local:3181
 has been quarantined because of read/write errors.
   platform-pulsar-broker-1 platform-pulsar-broker 16:53:05.284 
[BookKeeperClientScheduler-OrderedScheduler-0-0] WARN  
org.apache.bookkeeper.client.BookieWatcherImpl - 
   Bookie 
platform-pulsar-bookkeeper-8.platform-pulsar-bookkeeper.cogito-load.svc.cluster.local:3181
 has been quarantined because of read/write errors.
   ```
   When these bookies are quarantined the overall throughput scalability of our 
system is reduced.
   We would like some scheme to monitor and alert if a bookie is quarantined.
   
   **Describe the solution you'd like**
   We'd like to see some kind of Prometheus metrics added that indicated the 
number of Bookies that are currently quarantined.
   Once that metrics is added we can use it in Prometheus alerting to notify us 
of the problem.
   
   **Describe alternatives you've considered**
   Can't think of any
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to