[kudu-CR] Add Prometheus HTTP service discovery

Alexey Serbin (Code Review) Fri, 18 Jul 2025 11:53:31 -0700

Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21723 )


Change subject: Add Prometheus HTTP service discovery
......................................................................


Patch Set 9:

(9 comments)

Overall looks OK for a single-master Kudu clusters, just a few nits.

As for making this work for multi-master Kudu clusters, it's up to you whether 
you want to address it in this patch right away or in a follow-up patch (maybe 
be along with adding extra hold-off period to allow for registration of Kudu 
tablet servers after masters starting up).

http://gerrit.cloudera.org:8080/#/c/21723/9//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/21723/9//COMMIT_MSG@110
PS9, Line 110: url: 'http://127.0.0.1:8765/prometheus-sd'
What are you thoughts on multi-master Kudu clusters?  IIRC, the list of 
registered tablet servers is ephemeral and exists only in the memory of leader 
master (e.g., see  TSManager::UnregisterTServer()).  Should we provide just one 
of many (e.g., 3) master URLs, and do HTTP redirect to the current leader 
master when serving SD requests from a Prometheus instance?  I added a comment 
about that around the corresponding code as well.


http://gerrit.cloudera.org:8080/#/c/21723/9/src/kudu/master/master-test.cc
File src/kudu/master/master-test.cc:

http://gerrit.cloudera.org:8080/#/c/21723/9/src/kudu/master/master-test.cc@3709
PS9, Line 3709: TEST_F(MasterStartupTest, 
PrometheusServiceDiscoveryDuringStartup) {
Did you find this test stable enough?  I'd think that in case of fast startup 
of master and the presence of scheduler anomalies where the 'monitor_thread' is 
put of CPU for quite some time, this new scenario might become flaky.  If it's 
indeed a possibility, consider using 
--catalog_manager_inject_latency_load_ca_info_ms to inject extra latency into 
the system catalog's startup sequence, and also using 
WaitUntilCatalogManagerIsLeaderAndReadyForTests() instead of 
WaitForCatalogManagerInit() after restart, making this scenario more stable 
under such circumstances.


http://gerrit.cloudera.org:8080/#/c/21723/9/src/kudu/master/master-test.cc@3750
PS9, Line 3750:   // Cluster id might not be available immediately after 
WaitForCatalogManagerInit(), so we
              :   // wait for the prometheus-sd endpoint to return HTTP 200.
              :   ASSERT_EVENTUALLY([&] {
              :     EasyCurl c;
              :     faststring buf;
              :     ASSERT_OK(c.FetchURL(Substitute("http://$0/prometheus-sd";, 
addr), &buf));
              :   });
              : }
Alternatively, you might use WaitUntilCatalogManagerIsLeaderAndReadyForTests() 
instead of WaitForCatalogManagerInit() for single-master scenarios to avoid 
wrapping this into ASSERT_EVENTUALLY.


http://gerrit.cloudera.org:8080/#/c/21723/3/src/kudu/master/master_path_handlers.cc
File src/kudu/master/master_path_handlers.cc:

http://gerrit.cloudera.org:8080/#/c/21723/3/src/kudu/master/master_path_handlers.cc@880
PS3, Line 880:   jw.EndArray();
> Yep that makes sense, thanks for taking the time to think this through!
Thank you for the update.  Agreed: adding a grace period is  better to be done 
in a separate patch.


http://gerrit.cloudera.org:8080/#/c/21723/3/src/kudu/master/master_path_handlers.cc@936
PS3, Line 936:           Substitute
> Yep exactly.
That looks good to me, thanks.


http://gerrit.cloudera.org:8080/#/c/21723/8/src/kudu/master/master_path_handlers.cc
File src/kudu/master/master_path_handlers.cc:

http://gerrit.cloudera.org:8080/#/c/21723/8/src/kudu/master/master_path_handlers.cc@933
PS8, Line 933:
> This is HostPortPB. If i'm not mistaken there is no ToString().
whoops


http://gerrit.cloudera.org:8080/#/c/21723/9/src/kudu/master/master_path_handlers.cc
File src/kudu/master/master_path_handlers.cc:

http://gerrit.cloudera.org:8080/#/c/21723/9/src/kudu/master/master_path_handlers.cc@969
PS9, Line 969:   auto refresh_interval_it = 
req.request_headers.find("X-Prometheus-Refresh-Interval-Seconds");
             :   if (refresh_interval_it != req.request_headers.end()) {
             :     VLOG(1) << Substitute("Prometheus service discovery refresh 
interval: $0 seconds",
             :                           refresh_interval_it->second);
             :   }
nit: enclose this into 'if (VLOG_IS_ON(1)) { ... }' clause to avoid spending 
CPU resources when they not verbose logging is not requested.


http://gerrit.cloudera.org:8080/#/c/21723/9/src/kudu/master/master_path_handlers.cc@978
PS9, Line 978:   if (!master_->catalog_manager()->IsInitialized()) {
I think we should check for a stronger condition here, e.g. make sure the Raft 
consensus is running, this instance is a leader master.  That would be a 
guarantee at least to have valid (i.e. non-empty) cluster_id and up-to-date 
list of tablet servers.  So, there also should be HTTP redirect to current 
leader master.

When doing so, CatalogManager::ScopedLeaderSharedLock construct should be used: 
check for already existing call sites in this file.

What do you think?


http://gerrit.cloudera.org:8080/#/c/21723/9/src/kudu/master/master_path_handlers.cc@993
PS9, Line 993:     WriteEmptyPrometheusSDResponse(output);
Shouldn't the server respond with HttpStatusCode::ServiceUnavailable here as 
well?



--
To view, visit http://gerrit.cloudera.org:8080/21723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I931aa72a7567c0dde43d7b7ed53a2dd0fa8bc1fe
Gerrit-Change-Number: 21723
Gerrit-PatchSet: 9
Gerrit-Owner: Marton Greber <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Attila Bukor <[email protected]>
Gerrit-Reviewer: Gabriella Lotz <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Marton Greber <[email protected]>
Gerrit-Reviewer: Wang Xixu <[email protected]>
Gerrit-Reviewer: Zoltan Chovan <[email protected]>
Gerrit-Reviewer: Zoltan Martonka <[email protected]>
Gerrit-Comment-Date: Fri, 18 Jul 2025 18:53:04 +0000
Gerrit-HasComments: Yes

[kudu-CR] Add Prometheus HTTP service discovery

Reply via email to