Hi all, I have been running into some strange snmp walk timeout issues with
snmp exporter against citrix netscaler appliances.
Running latest (0.18.0) snmp exporter as a docker container.
If I try to walk the "vServer" or other similar metrics which have a time
series for each vserver (as opposed to e.g. netscaler appliance cpu
metrics), the walks are failing due to timeouts in a bizzarely periodic
way. We currently have around ~420 vservers on each load balancer.
*Behaviour*
The snmp exporter will fail to walk the netscaler at approx 15 mins past
the hour every hour, and will not walk again correctly for 15-20 mins. I am
walking 2 netscalers, and the scrapes fail on both netscalers at the same
time. One resumes walking after about 15 mins, while the other takes about
25 min to resume walking. Image shows "snmp_scrape_duration_seconds" for
the netscaler module from the Prometheus interface.
[image: snmp_timeout.PNG]
The problem is not with Prometheus as you can observe the timeouts when
targeting the netscaler from the SNMP exporter web interface which reports
the following error:
An error has occurred while serving metrics:
error collecting metric Desc{fqName: "snmp_error", help: "Error scraping
target", constLabels: {}, variableLabels: []}: error walking target
example.com: Request timeout (after 3 retries)
The logs for the snmp generator container show this error:
level=info ts=2020-06-07T23:28:20.946Z caller=collector.go:224
module=citrix_adc
target=example.com msg="Error scraping target" err="scrape canceled
(possible timeout) walking target example.com"
A few days ago I was using snmp exporter version 0.17.0 and the error was
more along the lines of `context canceled`. I realise there were some
updates to timeouts made in the latest update but that doesn't seem to be
helping in this situation (see more info about my timeout settings further
below).
No noticible problems are happening from the netscaler's perspective, these
are production appliances and everything is runninng fine.
I am not sure if this is an snmp exporter related problem or a netscaler
related problem.
I have done testing from the command line to confirm snmp the netscaler is
still responding. This command takes longer than during the 'non-timeout'
period, but it does not time out or fail. The fact that I can run
`snmpbulkwalk` on the entire `vserver` table from my command line and get
no timeout error during the same period makes me think it's smnp exporter
related, whereas the fact that it happens on a regular periodic cycle makes
me think it could be something that's happening on the netsclaer.
If I generate a new minimal snmp.conf during the 'timeout period' with the
vserver related OID's removed and just leave e.g. netsclaer cpu stats, the
walks will resume straight away.
When I time the running `snmpbulkwalk` on the verserver table (using linux
"time" command") from the command line it normally records about 3s to run.
During the weird hourly 'timeout' period it takes about 6 seconds.
Changing my `timeout` or `max_repetitions` does not seem to have any effect
as I have tried setting timeout value > 30s, and both increasing and
decreasing the `max_repetitions` and it still fails. The snmp exporter
fails to walk one column of a table, while I can walk the entire table with
no failure from the command line.
I cannot see any reference to setting of snmp timeouts or rate limiting on
the netscaler.
Can anyone help me narrow down if this is an snmp exporter issue or a
netscaler issue?
Thanks.
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/2740b34d-8ae3-4733-9946-740a8f0f9288o%40googlegroups.com.