Tim Carey-Smith created KAFKA-5120:
--------------------------------------
Summary: Several controller metrics block if controller lock is
held by another thread
Key: KAFKA-5120
URL: https://issues.apache.org/jira/browse/KAFKA-5120
Project: Kafka
Issue Type: Bug
Components: controller, metrics
Affects Versions: 0.10.2.0
Reporter: Tim Carey-Smith
Priority: Minor
We have been tracking latency issues surrounding queries to Controller MBeans.
Upon digging into the root causes, we discovered that several metrics acquire
the controller lock within the gauge.
The affected metrics are:
* {{ActiveControllerCount}}
* {{OfflinePartitionsCount}}
* {{PreferredReplicaImbalanceCount}}
If the controller is currently holding the lock and a MBean request is
received, the thread executing the request will block until the controller
releases the lock.
We discovered this in a cluster where the controller was holding the lock for
extended periods of time for normal operations. We have documented this issue
in KAFKA-5116.
Several possible solutions exist:
* Remove the lock from inside these {{Gauge}}s.
* Store and update the metric values in {{AtomicLong}}s.
Modifying the {{ActiveControllerCount}} metric seems to be straight-forward
while the other 2 metrics seem to be more involved.
We're happy to contribute a patch, but wanted to discuss potential solutions
and their tradeoffs before proceeding.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)