Lukasz Mierzwa created KAFKA-6436:
-------------------------------------
Summary: Provide a metric indicating broker cluster membership
state
Key: KAFKA-6436
URL: https://issues.apache.org/jira/browse/KAFKA-6436
Project: Kafka
Issue Type: Wish
Components: metrics
Reporter: Lukasz Mierzwa
Priority: Minor
When deploying kafka config changes each instance needs to be restarted (since
there's no graceful reload) and that requires coordination to keep all
partitions on-line. Part of the automation I have waits after restarting each
instance until restarted broker is back in sync on all partitions, to do that I
query for:
{noformat}
kafka.server:name=BrokerState,type=KafkaServer to be 3 (broker is up & running)
kafka.server:clientId=Replica,name=MaxLag,type=ReplicaFetcherManager = 0
(there's no lag)
{noformat}
I've noticed that there's a race for the MaxLag metric - when replica fetcher
threads are starting this metric will be initialized with 0 value, then (I
assume) once all threads connect to the leaders it's populated with "correct"
MaxLag value computed from all those threads. This means that there's a window
where I can query for those metrics and get expected BrokerState=3 and MaxLag=0
which would I interpret as "done restarting this instance" but a few seconds
later MaxLag might jump to a huge value.
Right now my workaround is to require multiple queries to return expected
metric values, which seems to protect me from hitting that window.
It would be nice if there was a metric like "ClusterState" initialized as 0
that would be set to 1 only once all replica fetcher threads are started,
completed reconnecting to the leaders and proper MaxLag is set (or there's no
replicas on given broker).
Alternatively MaxLag could be just initialized with -1 and set to 0 later if
that's the actual max lag computed after getting replication offsets from
leaders (if that would work).
If there was a "ClusterState" metric it could also be used to signal if a
broker loses connectivity with the rest of the cluster, I don't there is such
metric right now (is there?).
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)