Thanks for the KIP, Dong. The general idea is good. In fact, two of the three metrics had been listed under future work for KIP-143:
"KAFKA-5028 introduced a queue for Controller events. It would be useful to have a gauge for the queue size and a histogram for how long an event waits in the queue before being processed. However, we are in the process of making additional changes to improve the handling of soft failures and there's a possibility that the controller queue could be replaced by a broker queue for all ZK communication. We will see how that develops before deciding which metrics should be exposed. In the meantime, the ControllerState and other metrics should provide enough information to issue an alert if the Controller is not healthy." It seems like the conclusion is that we won't have a global broker queue, but it would be good for Jun and Onur to confirm this. One minor comment: 1. For metric 1 and 2 in the KIP, do we want the type to be ControllerEventManager or should it be ControllerStats like many other Controller metrics? Ismael On Thu, Dec 7, 2017 at 4:21 AM, Dong Lin <lindon...@gmail.com> wrote: > Hi all, > > I have created KIP-237: More Controller Health Metrics > https://cwiki.apache.org/confluence/display/KAFKA/KIP- > 237%3A+More+Controller+Health+Metrics > . > > The KIP proposes to add a few more metrics to help monitor Kafka Controller > health. Feedback and suggestions are welcome! > > Thanks, > Dong >