[prometheus-users] Re: HTTP phases explanation wanted

Brian Candler Tue, 25 Jan 2022 06:45:35 -0800

The code for setting the "processing" metric is here:
https://github.com/prometheus/blackbox_exporter/blob/master/prober/http.go#L621


"resolve": trace.start to dnsDone
"connect": dnsDone to gotConn (for non-TLS) (*)
"processing": gotConn to responseStart
"transfer": responseStart to trace.end

AFAICS, "processing" is the time between the connection being established 
(and presumably the request being written), until the beginning of the HTTP 
response being received.

> makes us think not only about network problems, but also about the 
performance of the exporter itself, for example in case there are a large 
number of competitive checks.

It could be, but it seems unlikely.  I'd say it's more likely to be slow 
response from the remote webserver that you're probing, which is what this 
metric is intended to measure.

It *could* be network packet loss causing delays and retransmits, but then 
I'd expect the "connect" and/or "tls" time to be high as well. Similarly, 
if the box where blackbox_exporter is running were heavily overloaded, then 
several metrics would likely be high.  If only "processing" is high, then 
more likely this is a real reflection of the remote webserver being slow to 
respond.

Some webservers perform reverse DNS lookups on incoming connections.  
That's one possible cause for the webserver to be slow in sending a 
response (and maybe your traces don't start until the webserver has 
finished this lookup).  But you'll need to drill down more carefully at 
your side; looking at timestamped packet captures may help.

(*) for TLS, "connect" is from dnsDone to connectDone, and "tls" is from 
tlsStart to tlsDone.  However, "processing" is still measured from gotConn, 
and I don't know how that fits in (i.e. before or after TLS).  You may be 
able to find more from the docs for httptrace.ClientTrace 
<https://pkg.go.dev/net/http/httptrace#ClientTrace>.

On Tuesday, 25 January 2022 at 11:51:07 UTC [email protected] wrote:

> Metric values in case of failure (5 sec timeout):
> # HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns 
> lookup in seconds
> # TYPE probe_dns_lookup_time_seconds gauge
> *probe_dns_lookup_time_seconds 0.003334816*
> # HELP probe_duration_seconds Returns how long the probe took to complete 
> in seconds
> # TYPE probe_duration_seconds gauge
> *probe_duration_seconds 5.000336144*
> # HELP probe_failed_due_to_regex Indicates if probe failed due to regex
> # TYPE probe_failed_due_to_regex gauge
> probe_failed_due_to_regex 0
> # HELP probe_http_content_length Length of http content response
> # TYPE probe_http_content_length gauge
> probe_http_content_length 0
> # HELP probe_http_duration_seconds Duration of http request by phase, 
> summed over all redirects
> # TYPE probe_http_duration_seconds gauge
>
>
>
>
> *probe_http_duration_seconds{phase="connect"} 
> 0probe_http_duration_seconds{phase="processing"} 
> 0probe_http_duration_seconds{phase="resolve"} 
> 0.003334816probe_http_duration_seconds{phase="tls"} 
> 0probe_http_duration_seconds{phase="transfer"} 0*
> # HELP probe_http_redirects The number of redirects
> # TYPE probe_http_redirects gauge
> probe_http_redirects 0
> # HELP probe_http_ssl Indicates if SSL was used for the final redirect
> # TYPE probe_http_ssl gauge
> probe_http_ssl 0
> # HELP probe_http_status_code Response HTTP status code
> # TYPE probe_http_status_code gauge
> probe_http_status_code 0
> # HELP probe_http_uncompressed_body_length Length of uncompressed response 
> body
> # TYPE probe_http_uncompressed_body_length gauge
> probe_http_uncompressed_body_length 0
> # HELP probe_http_version Returns the version of HTTP of the probe response
> # TYPE probe_http_version gauge
> probe_http_version 0
> # HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
> # TYPE probe_ip_protocol gauge
> probe_ip_protocol 4
> # HELP probe_success Displays whether or not the probe was a success
> # TYPE probe_success gauge
> *probe_success 0*
>
> вторник, 25 января 2022 г. в 14:47:53 UTC+3, Houston Heat: 
>
>> I found an issue (
>> https://github.com/prometheus/blackbox_exporter/issues/579) that 
>> characterizes a similar change in time for the rest of the phases, except 
>> for "processing" - the metric returns the value 0, but this phase, in 
>> principle, did not start processing at all.
>>
>> вторник, 25 января 2022 г. в 14:07:58 UTC+3, Houston Heat: 
>>
>>> There are several phases of HTTP (github) 
>>> <https://github.com/prometheus/blackbox_exporter/blob/e34a36be40264b1ceda1ba8ddbd769c4a62eac16/prober/http.go#L310>
>>>  in 
>>> Prometheus metric labels while we are probing data via BlackBox exporter:
>>>
>>>
>>>    1. resolve
>>>    2. connect
>>>    3. tls
>>>    4. processing
>>>    5. transfer
>>>
>>> What does "processing" stand for in the context of an exporter? What 
>>> kind of processing exactly (js/html/json render waiting) if the response 
>>> results time are coming from "transfer" phase?
>>>
>>> Context: We found a sudden increase in latency and availability on one 
>>> of our sites when checking with BlackBox exporter, but the total latency 
>>> measured by request tracing and APM (NewRelic) did not change. This is 
>>> supported by additional geo-distributed blackboxes and synthetic monitoring 
>>> tools. We are convinced that in general the work of the site has not 
>>> changed, however, network/hardware/software problems may be observed in the 
>>> environment of this particular exporter, and therefore we would like to 
>>> understand how the "processing" stage reflects the picture of the request.
>>>
>>> It is quite obvious to me that if the site is unavailable, the check 
>>> time of *probe_duration_seconds* will tend to the timeout value, 
>>> however, a significant difference in the time of the 
>>> *probe_http_duration_seconds{phase="processing"}* makes us think not 
>>> only about network problems, but also about the performance of the exporter 
>>> itself, for example in case there are a large number of competitive checks.
>>>
>>> [image: overall-phases-shift.png] [image: 
>>> phases-shift-without-processing.png]
>>>
>>> [image: probe_http_duration-shift.png] [image: probe-failures.png]
>>>
>>> Single scrape example: 
>>>
>>> *# HELP probe_duration_seconds Returns how long the probe took to 
>>> complete in seconds# TYPE probe_duration_seconds gauge*
>>> probe_duration_seconds 1.6482566360000002
>>>
>>> *# HELP probe_http_content_length Length of http content response# TYPE 
>>> probe_http_content_length gauge*
>>> probe_http_content_length -1
>>> *# HELP probe_http_duration_seconds Duration of http request by phase, 
>>> summed over all redirects*
>>> *# TYPE probe_http_duration_seconds gauge*
>>> probe_http_duration_seconds{phase="connect"} 0.00674702
>>> probe_http_duration_seconds{phase="processing"} 1.57984579
>>> probe_http_duration_seconds{phase="resolve"} 0.008215475
>>> probe_http_duration_seconds{phase="tls"} 0.024449581
>>> probe_http_duration_seconds{phase="transfer"} 0.028283079
>>>
>>> *# HELP probe_http_redirects The number of redirects# TYPE 
>>> probe_http_redirects gauge*
>>> probe_http_redirects 1
>>>
>>> Thank you.
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/eb483018-289e-4a16-a167-43217a841430n%40googlegroups.com.

[prometheus-users] Re: HTTP phases explanation wanted

Reply via email to