Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21723 )

Change subject: Add Prometheus HTTP service discovery
......................................................................


Patch Set 3:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/21723/3/src/kudu/master/master-test.cc
File src/kudu/master/master-test.cc:

http://gerrit.cloudera.org:8080/#/c/21723/3/src/kudu/master/master-test.cc@3883
PS3, Line 3883: } // namespace master
> Does it make sense to add a test to ensure the reported list of targets is
Ah, and now I think we should get HTTP 503 response instead of HTTP 200 with 
empty array/list, but it would be great to add a test scenario to verify that 
condition as well.


http://gerrit.cloudera.org:8080/#/c/21723/8/src/kudu/master/master-test.cc
File src/kudu/master/master-test.cc:

http://gerrit.cloudera.org:8080/#/c/21723/8/src/kudu/master/master-test.cc@3961
PS8, Line 3961:
nit: would be great to have a small doc blurb on what this scenario tries to 
verify


http://gerrit.cloudera.org:8080/#/c/21723/8/src/kudu/master/master-test.cc@3969
PS8, Line 3969:
              :
              :
Does it make sense to test for the presence of all the expected fields here 
(e.g., cluster_id)?


http://gerrit.cloudera.org:8080/#/c/21723/8/src/kudu/master/master_path_handlers.h
File src/kudu/master/master_path_handlers.h:

http://gerrit.cloudera.org:8080/#/c/21723/8/src/kudu/master/master_path_handlers.h@53
PS8, Line 53:
Even if right now Kudu masters don't have assigned locations, I'd think of 
using the same approach for identifying master and tserver scraping targets 
from the Prometheus point of view, and reporting locations for masters the same 
way as non-assigned locations are reported for tservers.

BTW, we might eventually add locations for masters as well if trying to achieve 
a client's affinity to closest master, etc.


http://gerrit.cloudera.org:8080/#/c/21723/3/src/kudu/master/master_path_handlers.cc
File src/kudu/master/master_path_handlers.cc:

http://gerrit.cloudera.org:8080/#/c/21723/3/src/kudu/master/master_path_handlers.cc@880
PS3, Line 880:     JsonError(s, output);
I missed responding to this in prior review rounds, but it's better late than 
never.

I don't think it's a good idea to respond with an empty list upon master's 
restart because it would add extra volatility and change in the list of 
targets, given that Prometheus caches target's list and naturally uses the 
latest cached HTTP 200 response from the SD point, as per documentation:

> Prometheus caches target lists. If an error occurs while  fetching an updated 
> targets list, Prometheus keeps using the current targets list. The targets 
> list is not saved across restart. The prometheus_sd_http_failures_total 
> counter metric tracks the number of refresh failures.

I'd suggest responding with HTTP 503 (ServiceUnavailable) while Kudu masters 
are being bootstrapped, and maybe even for a short time interval after that: 
(FLAGS_heartbeat_rpc_timeout_ms + FLAGS_heartbeat_interval_ms) being an extra 
delta.

The idea is to avoid changing the targets' list when Kudu masters restart.  
Imagine a situation when all the tablet servers are kept running, but masters 
are being restarted.  If that happens, it's better to wait a bit to allow for 
all the tablet servers to re-register with masters after the restart than 
induce the change of targets, leading to blackouts in the collected tablet 
servers' metrics for the SD refresh interval.  Even if the SD refresh interval 
by default is 60 seconds, avoiding extra dynamics there is a good point, IMO.


http://gerrit.cloudera.org:8080/#/c/21723/3/src/kudu/master/master_path_handlers.cc@936
PS3, Line 936:   jw.String("Kudu");
> Great.  Now, should different clusters be scraped by different Prometheus j
Indeed, it seems Kudu nodes from different clusters should logically belong to 
different jobs: https://prometheus.io/docs/concepts/jobs_instances/

Also, since different Kudu clusters might have different authentication 
requirements, etc., it makes sense to use separate Prometheus jobs for them to 
be able to configure scrapping independently using scrape_config: 
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config


http://gerrit.cloudera.org:8080/#/c/21723/8/src/kudu/master/master_path_handlers.cc
File src/kudu/master/master_path_handlers.cc:

http://gerrit.cloudera.org:8080/#/c/21723/8/src/kudu/master/master_path_handlers.cc@917
PS8, Line 917:   // Create Prometheus service discovery JSON.
> Some tablet servers might considered down after they have registered with m
Ah, please ignore this.  After some consideration, it's now clear to me that we 
want to reduce the amount of dynamics in the targets' list as much as possible.


http://gerrit.cloudera.org:8080/#/c/21723/8/src/kudu/master/master_path_handlers.cc@933
PS8, Line 933: g("labels");
Why not to use HostPort::ToString() instead?


http://gerrit.cloudera.org:8080/#/c/21723/8/src/kudu/master/master_path_handlers.cc@947
PS8, Line 947:
             :   jw.EndArray();
nit: using Substitute() makes messages' format easier to comprehend


http://gerrit.cloudera.org:8080/#/c/21723/8/src/kudu/master/master_path_handlers.cc@953
PS8, Line 953: ing("group");
Use HostPort::ToString() instead


http://gerrit.cloudera.org:8080/#/c/21723/8/src/kudu/master/master_path_handlers.cc@962
PS8, Line 962:   constexpr const bool is_on_nav_bar = true;
nit: it might be useful to output the X-Prometheus-Refresh-Interval-Seconds 
header's setting (i.e. the SD refresh interval [1]) with VLOG(1) to help with 
troubleshooting, if any is necessary.

[1] https://prometheus.io/docs/prometheus/latest/http_sd/



--
To view, visit http://gerrit.cloudera.org:8080/21723
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I931aa72a7567c0dde43d7b7ed53a2dd0fa8bc1fe
Gerrit-Change-Number: 21723
Gerrit-PatchSet: 3
Gerrit-Owner: Marton Greber <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Attila Bukor <[email protected]>
Gerrit-Reviewer: Gabriella Lotz <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Marton Greber <[email protected]>
Gerrit-Reviewer: Wang Xixu <[email protected]>
Gerrit-Reviewer: Zoltan Chovan <[email protected]>
Gerrit-Reviewer: Zoltan Martonka <[email protected]>
Gerrit-Comment-Date: Wed, 16 Jul 2025 06:39:39 +0000
Gerrit-HasComments: Yes

Reply via email to