vineeth1995 opened a new issue, #20389:
URL: https://github.com/apache/pulsar/issues/20389
#Motivation:
Broker metrics don't have anything to indicate health of the broker (to
indicate if broker is active). In Prometheus broker metrics which are used for
monitoring, it will be useful if metrics also show the broker health. This way,
Prometheus can automatically scrape the broker state and can be used for
monitoring purposes. So we need such metric to capture broker health.
#Goals:
This PIP adds support to include the broker health status in the broker
operability metrics.
Sample:
When we hit /metrics endpoint a part of the output looks like below. Notice
the pulsar_health metric which is added as a result of this PIP. The status "1"
says that broker is active, 0 is inactive, -1 unknown.
```
Output
# TYPE pulsar_active_connections gauge
pulsar_active_connections{cluster="standalone", broker="localhost",
metric="broker_connection"} 2
# TYPE pulsar_connection_closed_total_count gauge
pulsar_connection_closed_total_count{cluster="standalone",
broker="localhost", metric="broker_connection"} 0
# TYPE pulsar_connection_create_fail_count gauge
pulsar_connection_create_fail_count{cluster="standalone",
broker="localhost", metric="broker_connection"} 0
# TYPE pulsar_connection_create_success_count gauge
pulsar_connection_create_success_count{cluster="standalone",
broker="localhost", metric="broker_connection"} 2
# TYPE pulsar_connection_created_total_count gauge
pulsar_connection_created_total_count{cluster="standalone",
broker="localhost", metric="broker_connection"} 2
# TYPE pulsar_health gauge
pulsar_health{cluster="standalone", broker="localhost",
metric="broker_connection"} 1
```
#Approach:
A new metric called "brk_health" is added into the BrokerOperabilityMetrics.
This metric is updated at a fixed rate from the BrokerService.
```
BrokerOperabilityMetrics.java
Metrics getConnectionMetrics() {
Metrics rMetrics =
Metrics.create(getDimensionMap("broker_connection"));
rMetrics.put("brk_connection_created_total_count",
connectionTotalCreatedCount.longValue());
rMetrics.put("brk_connection_create_success_count",
connectionCreateSuccessCount.longValue());
rMetrics.put("brk_connection_create_fail_count",
connectionCreateFailCount.longValue());
rMetrics.put("brk_connection_closed_total_count",
connectionTotalClosedCount.longValue());
rMetrics.put("brk_active_connections", connectionActive.longValue());
rMetrics.put("brk_health", healthCheckStatus);
return rMetrics;
}
```
We schedule a periodic health check job at a fixed rate in the
BrokerService. This job updates the broker health check metric in the
BrokerOperabilityMetrics stats based on the frequency configured in the broker
configs.
```
BrokerService.java
protected void initializeHealthChecker() {
ServiceConfiguration config = pulsar().getConfiguration();
if (config.getHealthCheckMetricsUpdateTimeInSeconds() > 0) {
int interval = config.getHealthCheckMetricsUpdateTimeInSeconds();
statsUpdater.scheduleAtFixedRate(this::checkHealth,
interval, interval, TimeUnit.SECONDS);
}
}
public CompletableFuture<Void> checkHealth() {
return internalRunHealthCheck(TopicVersion.V2, pulsar(),
null).thenAccept(__ -> {
this.pulsarStats.getBrokerOperabilityMetrics().recordHealthCheckStatusSuccess();
}).exceptionally(ex -> {
this.pulsarStats.getBrokerOperabilityMetrics().recordHealthCheckStatusFail();
return null;
});
}
```
No new API is needed as we already have a "healthCheck" API in Admin module
which provides the necessary functionality. However we don't make a REST call
to this API as it could be costly. Instead we add a helper function
"internalRunHealthCheck" in the Admin module which piggy backs on the existing
functionality in the Admin module.
#Configuration Changes:
This PIP gives option to dynamically switch on/off the broker health check
metric using "healthCheckMetricsUpdateTimeInSeconds" config. Setting it to -1
will disable the metric. We can also configure the frequency of the metric
update using this config. By default it is set to value "-1" which effectively
disables it.
```
broker.conf
healthCheckMetricsUpdateTimeInSeconds=-1
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]