Re: [prometheus-users] snmp exporter periodoic timeouts when walking citrix netscaler

Ben Kochie Sun, 07 Jun 2020 22:15:12 -0700

What is your scrape interval and scrape timeout on the Prometheus side?
Prometheus sends a default scrape timeout of 10s to the exporter. The
exporter timeout is only used if the timeout from the Prometheus server is
longer.


On Mon, Jun 8, 2020 at 1:39 AM Justin Teare <[email protected]> wrote:

> Hi all, I have been running into some strange snmp walk timeout issues
> with snmp exporter against citrix netscaler appliances.
>
> Running latest (0.18.0) snmp exporter as a docker container.
>
> If I try to walk the "vServer" or other similar metrics which have a time
> series for each vserver (as opposed to e.g. netscaler appliance cpu
> metrics), the walks are failing due to timeouts in a bizzarely periodic
> way. We currently have around ~420 vservers on each load balancer.
>
> *Behaviour*
>
> The snmp exporter will fail to walk the netscaler at approx 15 mins past
> the hour every hour, and will not walk again correctly for 15-20 mins. I am
> walking 2 netscalers, and the scrapes fail on both netscalers at the same
> time. One resumes walking after about 15 mins, while the other takes about
> 25 min to resume walking. Image shows "snmp_scrape_duration_seconds" for
> the netscaler module from the Prometheus interface.
>
> [image: snmp_timeout.PNG]
>
> The problem is not with Prometheus as you can observe the timeouts when
> targeting the netscaler from the SNMP exporter web interface which reports
> the following error:
>
> An error has occurred while serving metrics:
>
> error collecting metric Desc{fqName: "snmp_error", help: "Error scraping 
> target", constLabels: {}, variableLabels: []}: error walking target 
> example.com: Request timeout (after 3 retries)
>
>
> The logs for the snmp generator container show this error:
>
> level=info ts=2020-06-07T23:28:20.946Z caller=collector.go:224 
> module=citrix_adc
> target=example.com msg="Error scraping target" err="scrape canceled
> (possible timeout) walking target example.com"
>
> A few days ago I was using snmp exporter version 0.17.0 and the error was
> more along the lines of `context canceled`. I realise there were some
> updates to timeouts made in the latest update but that doesn't seem to be
> helping in this situation (see more info about my timeout settings further
> below).
>
> No noticible problems are happening from the netscaler's perspective,
> these are production appliances and everything is runninng fine.
>
> I am not sure if this is an snmp exporter related problem or a netscaler
> related problem.
>
> I have done testing from the command line to confirm snmp the netscaler is
> still responding. This command takes longer than during the 'non-timeout'
> period, but it does not time out or fail. The fact that I can run
> `snmpbulkwalk` on the entire `vserver` table from my command line and get
> no timeout error during the same period makes me think it's smnp exporter
> related, whereas the fact that it happens on a regular periodic cycle makes
> me think it could be something that's happening on the netsclaer.
>
> If I generate a new minimal snmp.conf during the 'timeout period' with the
> vserver related OID's removed and just leave e.g. netsclaer cpu stats, the
> walks will resume straight away.
>
> When I time the running  `snmpbulkwalk` on the verserver table (using
> linux "time" command") from the command line it normally records about 3s
> to run. During the weird hourly 'timeout' period it takes about 6 seconds.
>
> Changing my `timeout` or `max_repetitions` does not seem to have any
> effect as I have tried setting timeout value > 30s, and both increasing
> and decreasing the `max_repetitions`  and it still fails. The snmp
> exporter fails to walk one column of a table, while I can walk the entire
> table with no failure from the command line.
>
> I cannot see any reference to setting of snmp timeouts or rate limiting on
> the netscaler.
>
> Can anyone help me narrow down if this is an snmp exporter issue or a
> netscaler issue?
>
> Thanks.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/2740b34d-8ae3-4733-9946-740a8f0f9288o%40googlegroups.com
> <https://groups.google.com/d/msgid/prometheus-users/2740b34d-8ae3-4733-9946-740a8f0f9288o%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CABbyFmo%2Bw7R6-OT7JVRGE_M0cqcAuiX0FMjBfgmCVHytV1Qodw%40mail.gmail.com.

Re: [prometheus-users] snmp exporter periodoic timeouts when walking citrix netscaler

Reply via email to