Re: Cisco ASR9902 SNMP polling ... is interesting

Tom Beecher via NANOG Sun, 03 Aug 2025 06:46:56 -0700

>
> When the SNMP process received the poll request, it in turn fires off
> requests internally to other processes to get the stats being asked for.
> This is/was (I'm out of touch now) a maximum amount of time SNMP would wait
> for the other processes to respond. If they didn't respond in time the SNMP
> response was sent without those details, or the query which was pending an
> answer was just dropped and no response sent. So problem number one was
> those other processes taking too long to respond.



This is generally true with multiple vendors. Main SNMP process is
responsible for receiving/replying to requests, and separate processes for
actual collection of the data from elements.  If those collector processes
wedge or bog down ( on their own, or due to the element polled being bogged
down, etc) that timeout bubbles up and you get nothing.

Pretty standard design to segment the things this way.

On Sun, Aug 3, 2025 at 3:12 AM James Bensley via NANOG <
[email protected]> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA512
>
> On Friday, 1 August 2025 at 15:10, Drew Weaver via NANOG <
> [email protected]> wrote:
> >
> >
> > Hello,
>
> Hi Drew.
>
> I haven't worked with IOS-XR for a few years but I have had problems with
> SNMP in the past.
>
> A few years ago I was deploying 9904 chassis with a modest amount of
> services on them (not thousands of services per chassis, but hundreds, so
> they weren't idle, but certainly not under any mentionable load
> control-plane wise).
>
> We noticed that SNMP polling was returning nothing for some of the
> services and it ended up being a couple of problems compounding. At that
> time we had virtually every 9xxx and 99xx chassis in the network. This
> problem only exists with these boxes, but they were also the only routers
> in the network with this exact combination of services on them. So nothing
> chassis specific I believe, this was on IOS-XR 6.something for reference.
>
> When the SNMP process received the poll request, it in turn fires off
> requests internally to other processes to get the stats being asked for.
> This is/was (I'm out of touch now) a maximum amount of time SNMP would wait
> for the other processes to respond. If they didn't respond in time the SNMP
> response was sent without those details, or the query which was pending an
> answer was just dropped and no response sent. So problem number one was
> those other processes taking too long to respond.
>
> Problem number two was those other processes had a bug; after provisioning
> services those processes hadn't pick up on the changes. The request came
> from the SNMP process to the other processes for stats relating to X, the
> other processes had no knowledge of X.
>
> TAC provided us with a short term work around, which was to restart some
> processes after provisioning new services, to ensure the processes were
> aware of the new services and would respond to the SNMP process with the
> requested stats. Long term they created a DDTS and SMU to fix the
> inter-process timeout issues and missing stats issues.
>
> I don't know exactly what you're polling, and like I said, I'm a bit out
> of touch here, but I can say that it took quite a lot of digging and
> working with TAC to bottom out the problem. We could replicate the issue in
> the lab which always helps. So if you can replicate the issue in the lab,
> and turn all debugging settings up to 11, you might be able to find
> something like we did (TAC sent some debug commands and we could trace the
> issue in the lab, IPC debgging is hard on these boxes!). Even if TAC are
> trying to fob you off by saying "oh yeah this is dropped by LTSP as
> expected", get them to prove it to you; replicate the issue in the lab and
> gather the debug info which shows how/where the request is being dropped,
> if they can't find the drop in LTPS, then LTPS isn't the problem and you
> need to look else were like IPC/EOBC.
>
>
> Cheers,
> James.
>
> -----BEGIN PGP SIGNATURE-----
> Version: ProtonMail
>
> wsG5BAEBCgBtBYJojwuDCZCoEx+igX+A+0UUAAAAAAAcACBzYWx0QG5vdGF0
> aW9ucy5vcGVucGdwanMub3Jne6/4gXRiD1B/oyx0cm03xe+bPfK4lh4ErWip
> GQvWH9oWIQQ+k2NZBObfK8Tl7sKoEx+igX+A+wAAlZAP/3DFVyR1e2DiJ7bv
> 4udRjmX0xLtEpkZM7UJGwhihiIiqW/JV+TyqEq75Ko4Hu9xOiOURkz+VkBx6
> XfgbrFuXxPT/i4NhcMZ8qygSBwoAQK4Z6CIeXf9msWnly259hA5F88SB/oCc
> LKOjcH6hNHVI2+5jSIMJFqNVkD/3b2eSIF3ZHbdWsZ+uq6QRMMvM7gOHuJAm
> 0mCiOBTUbN4oIziQdN0u3tbWVgIWulC2TyM8wy2FGyN+r5ks/jqmZQhlTASo
> u+9kPtBZ4SQc0p9GwvYZN4XHXQtcftx7xrPymmXhwU+3UaE70YoSZuJVULE+
> eGipYUDUiQ9OA9pj39BWZe6fpRLqgoeEl6GDiavHYLcfw3CVkMwThPUGDRFX
> RDNxKpebdPEZHzsJyvqORgM+/RHYIAgqOOQIQdiZGbaiIxa8ooT06WJRkNWO
> iKL2jOkXndbbxWenyw4RNZwVX50H1Y79eqUxhU24yiA0Wfs6qVCRZWP3M//g
> a+BJwOBqb8gFmuJErvezWUPUNIt94UhEv8aFpVtPZ7R4IIpPzFBFlLUV4HEK
> F5IU9JgqvyBagubAPeIOoUk0+DboE4gGBPTz9RGWSfdxM+D5pX/HWBh8qIwB
> prO6hDk3PkkGAk4/fhd5jNmGk0hE0yKyTubE711vIJ9vXD1dJbqKgoOjSA18
> t315dumB
> =LkYJ
> -----END PGP SIGNATURE-----
> _______________________________________________
> NANOG mailing list
>
> https://lists.nanog.org/archives/list/[email protected]/message/LFEK3EROE2TNHT7KOSM5WMW5HXGR4LQL/
_______________________________________________
NANOG mailing list 
https://lists.nanog.org/archives/list/[email protected]/message/D4XN3V37DIIZM3PBTCS5DOI7LXZB5YVH/

Re: Cisco ASR9902 SNMP polling ... is interesting

Reply via email to