What is your scrape interval and scrape timeout on the Prometheus side? Prometheus sends a default scrape timeout of 10s to the exporter. The exporter timeout is only used if the timeout from the Prometheus server is longer.
On Mon, Jun 8, 2020 at 1:39 AM Justin Teare <[email protected]> wrote: > Hi all, I have been running into some strange snmp walk timeout issues > with snmp exporter against citrix netscaler appliances. > > Running latest (0.18.0) snmp exporter as a docker container. > > If I try to walk the "vServer" or other similar metrics which have a time > series for each vserver (as opposed to e.g. netscaler appliance cpu > metrics), the walks are failing due to timeouts in a bizzarely periodic > way. We currently have around ~420 vservers on each load balancer. > > *Behaviour* > > The snmp exporter will fail to walk the netscaler at approx 15 mins past > the hour every hour, and will not walk again correctly for 15-20 mins. I am > walking 2 netscalers, and the scrapes fail on both netscalers at the same > time. One resumes walking after about 15 mins, while the other takes about > 25 min to resume walking. Image shows "snmp_scrape_duration_seconds" for > the netscaler module from the Prometheus interface. > > [image: snmp_timeout.PNG] > > The problem is not with Prometheus as you can observe the timeouts when > targeting the netscaler from the SNMP exporter web interface which reports > the following error: > > An error has occurred while serving metrics: > > error collecting metric Desc{fqName: "snmp_error", help: "Error scraping > target", constLabels: {}, variableLabels: []}: error walking target > example.com: Request timeout (after 3 retries) > > > The logs for the snmp generator container show this error: > > level=info ts=2020-06-07T23:28:20.946Z caller=collector.go:224 > module=citrix_adc > target=example.com msg="Error scraping target" err="scrape canceled > (possible timeout) walking target example.com" > > A few days ago I was using snmp exporter version 0.17.0 and the error was > more along the lines of `context canceled`. I realise there were some > updates to timeouts made in the latest update but that doesn't seem to be > helping in this situation (see more info about my timeout settings further > below). > > No noticible problems are happening from the netscaler's perspective, > these are production appliances and everything is runninng fine. > > I am not sure if this is an snmp exporter related problem or a netscaler > related problem. > > I have done testing from the command line to confirm snmp the netscaler is > still responding. This command takes longer than during the 'non-timeout' > period, but it does not time out or fail. The fact that I can run > `snmpbulkwalk` on the entire `vserver` table from my command line and get > no timeout error during the same period makes me think it's smnp exporter > related, whereas the fact that it happens on a regular periodic cycle makes > me think it could be something that's happening on the netsclaer. > > If I generate a new minimal snmp.conf during the 'timeout period' with the > vserver related OID's removed and just leave e.g. netsclaer cpu stats, the > walks will resume straight away. > > When I time the running `snmpbulkwalk` on the verserver table (using > linux "time" command") from the command line it normally records about 3s > to run. During the weird hourly 'timeout' period it takes about 6 seconds. > > Changing my `timeout` or `max_repetitions` does not seem to have any > effect as I have tried setting timeout value > 30s, and both increasing > and decreasing the `max_repetitions` and it still fails. The snmp > exporter fails to walk one column of a table, while I can walk the entire > table with no failure from the command line. > > I cannot see any reference to setting of snmp timeouts or rate limiting on > the netscaler. > > Can anyone help me narrow down if this is an snmp exporter issue or a > netscaler issue? > > Thanks. > > -- > You received this message because you are subscribed to the Google Groups > "Prometheus Users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/prometheus-users/2740b34d-8ae3-4733-9946-740a8f0f9288o%40googlegroups.com > <https://groups.google.com/d/msgid/prometheus-users/2740b34d-8ae3-4733-9946-740a8f0f9288o%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CABbyFmo%2Bw7R6-OT7JVRGE_M0cqcAuiX0FMjBfgmCVHytV1Qodw%40mail.gmail.com.

