Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/21723 )
Change subject: Add Prometheus HTTP service discovery ...................................................................... Patch Set 9: (9 comments) Overall looks OK for a single-master Kudu clusters, just a few nits. As for making this work for multi-master Kudu clusters, it's up to you whether you want to address it in this patch right away or in a follow-up patch (maybe be along with adding extra hold-off period to allow for registration of Kudu tablet servers after masters starting up). http://gerrit.cloudera.org:8080/#/c/21723/9//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/21723/9//COMMIT_MSG@110 PS9, Line 110: url: 'http://127.0.0.1:8765/prometheus-sd' What are you thoughts on multi-master Kudu clusters? IIRC, the list of registered tablet servers is ephemeral and exists only in the memory of leader master (e.g., see TSManager::UnregisterTServer()). Should we provide just one of many (e.g., 3) master URLs, and do HTTP redirect to the current leader master when serving SD requests from a Prometheus instance? I added a comment about that around the corresponding code as well. http://gerrit.cloudera.org:8080/#/c/21723/9/src/kudu/master/master-test.cc File src/kudu/master/master-test.cc: http://gerrit.cloudera.org:8080/#/c/21723/9/src/kudu/master/master-test.cc@3709 PS9, Line 3709: TEST_F(MasterStartupTest, PrometheusServiceDiscoveryDuringStartup) { Did you find this test stable enough? I'd think that in case of fast startup of master and the presence of scheduler anomalies where the 'monitor_thread' is put of CPU for quite some time, this new scenario might become flaky. If it's indeed a possibility, consider using --catalog_manager_inject_latency_load_ca_info_ms to inject extra latency into the system catalog's startup sequence, and also using WaitUntilCatalogManagerIsLeaderAndReadyForTests() instead of WaitForCatalogManagerInit() after restart, making this scenario more stable under such circumstances. http://gerrit.cloudera.org:8080/#/c/21723/9/src/kudu/master/master-test.cc@3750 PS9, Line 3750: // Cluster id might not be available immediately after WaitForCatalogManagerInit(), so we : // wait for the prometheus-sd endpoint to return HTTP 200. : ASSERT_EVENTUALLY([&] { : EasyCurl c; : faststring buf; : ASSERT_OK(c.FetchURL(Substitute("http://$0/prometheus-sd", addr), &buf)); : }); : } Alternatively, you might use WaitUntilCatalogManagerIsLeaderAndReadyForTests() instead of WaitForCatalogManagerInit() for single-master scenarios to avoid wrapping this into ASSERT_EVENTUALLY. http://gerrit.cloudera.org:8080/#/c/21723/3/src/kudu/master/master_path_handlers.cc File src/kudu/master/master_path_handlers.cc: http://gerrit.cloudera.org:8080/#/c/21723/3/src/kudu/master/master_path_handlers.cc@880 PS3, Line 880: jw.EndArray(); > Yep that makes sense, thanks for taking the time to think this through! Thank you for the update. Agreed: adding a grace period is better to be done in a separate patch. http://gerrit.cloudera.org:8080/#/c/21723/3/src/kudu/master/master_path_handlers.cc@936 PS3, Line 936: Substitute > Yep exactly. That looks good to me, thanks. http://gerrit.cloudera.org:8080/#/c/21723/8/src/kudu/master/master_path_handlers.cc File src/kudu/master/master_path_handlers.cc: http://gerrit.cloudera.org:8080/#/c/21723/8/src/kudu/master/master_path_handlers.cc@933 PS8, Line 933: > This is HostPortPB. If i'm not mistaken there is no ToString(). whoops http://gerrit.cloudera.org:8080/#/c/21723/9/src/kudu/master/master_path_handlers.cc File src/kudu/master/master_path_handlers.cc: http://gerrit.cloudera.org:8080/#/c/21723/9/src/kudu/master/master_path_handlers.cc@969 PS9, Line 969: auto refresh_interval_it = req.request_headers.find("X-Prometheus-Refresh-Interval-Seconds"); : if (refresh_interval_it != req.request_headers.end()) { : VLOG(1) << Substitute("Prometheus service discovery refresh interval: $0 seconds", : refresh_interval_it->second); : } nit: enclose this into 'if (VLOG_IS_ON(1)) { ... }' clause to avoid spending CPU resources when they not verbose logging is not requested. http://gerrit.cloudera.org:8080/#/c/21723/9/src/kudu/master/master_path_handlers.cc@978 PS9, Line 978: if (!master_->catalog_manager()->IsInitialized()) { I think we should check for a stronger condition here, e.g. make sure the Raft consensus is running, this instance is a leader master. That would be a guarantee at least to have valid (i.e. non-empty) cluster_id and up-to-date list of tablet servers. So, there also should be HTTP redirect to current leader master. When doing so, CatalogManager::ScopedLeaderSharedLock construct should be used: check for already existing call sites in this file. What do you think? http://gerrit.cloudera.org:8080/#/c/21723/9/src/kudu/master/master_path_handlers.cc@993 PS9, Line 993: WriteEmptyPrometheusSDResponse(output); Shouldn't the server respond with HttpStatusCode::ServiceUnavailable here as well? -- To view, visit http://gerrit.cloudera.org:8080/21723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I931aa72a7567c0dde43d7b7ed53a2dd0fa8bc1fe Gerrit-Change-Number: 21723 Gerrit-PatchSet: 9 Gerrit-Owner: Marton Greber <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Attila Bukor <[email protected]> Gerrit-Reviewer: Gabriella Lotz <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Marton Greber <[email protected]> Gerrit-Reviewer: Wang Xixu <[email protected]> Gerrit-Reviewer: Zoltan Chovan <[email protected]> Gerrit-Reviewer: Zoltan Martonka <[email protected]> Gerrit-Comment-Date: Fri, 18 Jul 2025 18:53:04 +0000 Gerrit-HasComments: Yes
