TessaIO opened a new issue, #17932:
URL: https://github.com/apache/druid/issues/17932

   ### Description
   
   We've been experiencing lately some 
[issues](https://github.com/apache/druid/issues/17781) with Coordinators and 
Overlord leadership, which caused most of the tasks to fail.
   We've also been looking at a way to react as fast as possible to this, 
that's why we thought about adding an alarm to get fired when we have two 
leaders at the same time.
   We've been relying on the `service/heartbeat` metric and `leader` label. 
However, the current setup presents an issue in the following case:
   1. We have one leader coordinator called A, and B and  C are the followers
   2. A has `service/heartbeat{leader="1"} 1` and B and C have 
`service/heartbeat{leader="0"} 1`
   3. A goes down for some reason, and B becomes the leader
   4. A would have `service/heartbeat{leader="1"} 1` and 
`service/heartbeat{leader="0"} 1` and B would have the same
   5. At this point, we would think that we have two leaders, but in reality, 
this is just a false positive.
   
   The issue why we have the following behavior in this setup is because the 
metric is a heartbeat metric, and it doesn't act on the same metric/timeserie 
if the leadership changes, but it will create a new one
   
   To fix this, we need to have a specific metric, `is_leader`, and we act on 
this metric (increment and decrement) whenever we have a leadership change.
   
   ### Motivation
   
   Improve the metrics and monitoring of the system to be able to act faster in 
case of incidents.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to