Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/21723 )
Change subject: Add Prometheus HTTP service discovery ...................................................................... Patch Set 3: (11 comments) http://gerrit.cloudera.org:8080/#/c/21723/3/src/kudu/master/master-test.cc File src/kudu/master/master-test.cc: http://gerrit.cloudera.org:8080/#/c/21723/3/src/kudu/master/master-test.cc@3883 PS3, Line 3883: } // namespace master > Does it make sense to add a test to ensure the reported list of targets is Ah, and now I think we should get HTTP 503 response instead of HTTP 200 with empty array/list, but it would be great to add a test scenario to verify that condition as well. http://gerrit.cloudera.org:8080/#/c/21723/8/src/kudu/master/master-test.cc File src/kudu/master/master-test.cc: http://gerrit.cloudera.org:8080/#/c/21723/8/src/kudu/master/master-test.cc@3961 PS8, Line 3961: nit: would be great to have a small doc blurb on what this scenario tries to verify http://gerrit.cloudera.org:8080/#/c/21723/8/src/kudu/master/master-test.cc@3969 PS8, Line 3969: : : Does it make sense to test for the presence of all the expected fields here (e.g., cluster_id)? http://gerrit.cloudera.org:8080/#/c/21723/8/src/kudu/master/master_path_handlers.h File src/kudu/master/master_path_handlers.h: http://gerrit.cloudera.org:8080/#/c/21723/8/src/kudu/master/master_path_handlers.h@53 PS8, Line 53: Even if right now Kudu masters don't have assigned locations, I'd think of using the same approach for identifying master and tserver scraping targets from the Prometheus point of view, and reporting locations for masters the same way as non-assigned locations are reported for tservers. BTW, we might eventually add locations for masters as well if trying to achieve a client's affinity to closest master, etc. http://gerrit.cloudera.org:8080/#/c/21723/3/src/kudu/master/master_path_handlers.cc File src/kudu/master/master_path_handlers.cc: http://gerrit.cloudera.org:8080/#/c/21723/3/src/kudu/master/master_path_handlers.cc@880 PS3, Line 880: JsonError(s, output); I missed responding to this in prior review rounds, but it's better late than never. I don't think it's a good idea to respond with an empty list upon master's restart because it would add extra volatility and change in the list of targets, given that Prometheus caches target's list and naturally uses the latest cached HTTP 200 response from the SD point, as per documentation: > Prometheus caches target lists. If an error occurs while fetching an updated > targets list, Prometheus keeps using the current targets list. The targets > list is not saved across restart. The prometheus_sd_http_failures_total > counter metric tracks the number of refresh failures. I'd suggest responding with HTTP 503 (ServiceUnavailable) while Kudu masters are being bootstrapped, and maybe even for a short time interval after that: (FLAGS_heartbeat_rpc_timeout_ms + FLAGS_heartbeat_interval_ms) being an extra delta. The idea is to avoid changing the targets' list when Kudu masters restart. Imagine a situation when all the tablet servers are kept running, but masters are being restarted. If that happens, it's better to wait a bit to allow for all the tablet servers to re-register with masters after the restart than induce the change of targets, leading to blackouts in the collected tablet servers' metrics for the SD refresh interval. Even if the SD refresh interval by default is 60 seconds, avoiding extra dynamics there is a good point, IMO. http://gerrit.cloudera.org:8080/#/c/21723/3/src/kudu/master/master_path_handlers.cc@936 PS3, Line 936: jw.String("Kudu"); > Great. Now, should different clusters be scraped by different Prometheus j Indeed, it seems Kudu nodes from different clusters should logically belong to different jobs: https://prometheus.io/docs/concepts/jobs_instances/ Also, since different Kudu clusters might have different authentication requirements, etc., it makes sense to use separate Prometheus jobs for them to be able to configure scrapping independently using scrape_config: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config http://gerrit.cloudera.org:8080/#/c/21723/8/src/kudu/master/master_path_handlers.cc File src/kudu/master/master_path_handlers.cc: http://gerrit.cloudera.org:8080/#/c/21723/8/src/kudu/master/master_path_handlers.cc@917 PS8, Line 917: // Create Prometheus service discovery JSON. > Some tablet servers might considered down after they have registered with m Ah, please ignore this. After some consideration, it's now clear to me that we want to reduce the amount of dynamics in the targets' list as much as possible. http://gerrit.cloudera.org:8080/#/c/21723/8/src/kudu/master/master_path_handlers.cc@933 PS8, Line 933: g("labels"); Why not to use HostPort::ToString() instead? http://gerrit.cloudera.org:8080/#/c/21723/8/src/kudu/master/master_path_handlers.cc@947 PS8, Line 947: : jw.EndArray(); nit: using Substitute() makes messages' format easier to comprehend http://gerrit.cloudera.org:8080/#/c/21723/8/src/kudu/master/master_path_handlers.cc@953 PS8, Line 953: ing("group"); Use HostPort::ToString() instead http://gerrit.cloudera.org:8080/#/c/21723/8/src/kudu/master/master_path_handlers.cc@962 PS8, Line 962: constexpr const bool is_on_nav_bar = true; nit: it might be useful to output the X-Prometheus-Refresh-Interval-Seconds header's setting (i.e. the SD refresh interval [1]) with VLOG(1) to help with troubleshooting, if any is necessary. [1] https://prometheus.io/docs/prometheus/latest/http_sd/ -- To view, visit http://gerrit.cloudera.org:8080/21723 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I931aa72a7567c0dde43d7b7ed53a2dd0fa8bc1fe Gerrit-Change-Number: 21723 Gerrit-PatchSet: 3 Gerrit-Owner: Marton Greber <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Attila Bukor <[email protected]> Gerrit-Reviewer: Gabriella Lotz <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Marton Greber <[email protected]> Gerrit-Reviewer: Wang Xixu <[email protected]> Gerrit-Reviewer: Zoltan Chovan <[email protected]> Gerrit-Reviewer: Zoltan Martonka <[email protected]> Gerrit-Comment-Date: Wed, 16 Jul 2025 06:39:39 +0000 Gerrit-HasComments: Yes
