TessaIO opened a new issue, #17932: URL: https://github.com/apache/druid/issues/17932
### Description We've been experiencing lately some [issues](https://github.com/apache/druid/issues/17781) with Coordinators and Overlord leadership, which caused most of the tasks to fail. We've also been looking at a way to react as fast as possible to this, that's why we thought about adding an alarm to get fired when we have two leaders at the same time. We've been relying on the `service/heartbeat` metric and `leader` label. However, the current setup presents an issue in the following case: 1. We have one leader coordinator called A, and B and C are the followers 2. A has `service/heartbeat{leader="1"} 1` and B and C have `service/heartbeat{leader="0"} 1` 3. A goes down for some reason, and B becomes the leader 4. A would have `service/heartbeat{leader="1"} 1` and `service/heartbeat{leader="0"} 1` and B would have the same 5. At this point, we would think that we have two leaders, but in reality, this is just a false positive. The issue why we have the following behavior in this setup is because the metric is a heartbeat metric, and it doesn't act on the same metric/timeserie if the leadership changes, but it will create a new one To fix this, we need to have a specific metric, `is_leader`, and we act on this metric (increment and decrement) whenever we have a leadership change. ### Motivation Improve the metrics and monitoring of the system to be able to act faster in case of incidents. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
