Marton Greber has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21723 )

Change subject: Add Prometheus HTTP service discovery
......................................................................


Patch Set 10:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/21723/10/src/kudu/master/master-test.cc
File src/kudu/master/master-test.cc:

http://gerrit.cloudera.org:8080/#/c/21723/10/src/kudu/master/master-test.cc@4111
PS10, Line 4111: } // namespace master
> Since this changelist has added functionality to enable Prometheus service
Done


http://gerrit.cloudera.org:8080/#/c/21723/10/src/kudu/master/master_path_handlers.cc
File src/kudu/master/master_path_handlers.cc:

http://gerrit.cloudera.org:8080/#/c/21723/10/src/kudu/master/master_path_handlers.cc@984
PS10, Line 984: WriteEmptyPrometheusSDResponse(output);
> nit: since the server responds with HttpStatusCode::ServiceUnavailable, sen
I've checked the Prometheus source and as expected prometheus short circuits 
based on non 200 status codes [1], so yes this line can be removed.

[1] 
https://github.com/prometheus/prometheus/blob/7512d13e00c50b0287f8cd8576eb54d91977f77c/discovery/http/http.go#L172


http://gerrit.cloudera.org:8080/#/c/21723/10/src/kudu/master/master_path_handlers.cc@988
PS10, Line 988:   if (!l.leader_status().ok()) {
              :     WriteEmptyPrometheusSDResponse(output);
> If we are sending back and empty list with HTTP 200 when a particular insta
Ah, yes—nice catch, thank you!
I’ve looked into this, and it turns out the data is not necessarily wiped in 
such cases; it simply becomes stale [1],[2]:

“If a target scrape or rule evaluation no longer returns a sample for a time 
series that was previously present, that time series is marked as stale. If a 
target is removed, the previously retrieved time series will be marked stale 
soon after removal.” [1]

What isn’t clear to me is whether, once a series is marked stale, it 
automatically becomes “live” again when a different master is elected leader, 
serves the SD endpoint, and Prometheus can resume scraping.

If stale series can revert to normal as soon as new samples arrive, then we’re 
fine—but we need to test it. I’ll implement MiniPrometheus (KUDU-3685) and 
circle back once we have a proper test environment to verify scenarios like 
this with confidence.

[1] 
https://github.com/prometheus/prometheus/blob/7512d13e00c50b0287f8cd8576eb54d91977f77c/discovery/http/http.go#L172
[2] https://www.robustperception.io/staleness-and-promql/



--
To view, visit http://gerrit.cloudera.org:8080/21723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I931aa72a7567c0dde43d7b7ed53a2dd0fa8bc1fe
Gerrit-Change-Number: 21723
Gerrit-PatchSet: 10
Gerrit-Owner: Marton Greber <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Attila Bukor <[email protected]>
Gerrit-Reviewer: Gabriella Lotz <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Marton Greber <[email protected]>
Gerrit-Reviewer: Wang Xixu <[email protected]>
Gerrit-Reviewer: Zoltan Chovan <[email protected]>
Gerrit-Reviewer: Zoltan Martonka <[email protected]>
Gerrit-Comment-Date: Wed, 06 Aug 2025 12:57:36 +0000
Gerrit-HasComments: Yes

Reply via email to