On Tue, Nov 01, 2022 at 11:04:03AM +0100, Martijn van Duren wrote: > On Mon, 2022-10-31 at 20:14 -0700, Ryan Freeman wrote: > > > > I can confirm the snmpd process is no-longer disappearing with this > > patch. Almost 24 hours on one VM and 16 hours on another. Thanks! > > > > -Ryan > > To be complete, what happens is the following: > - snmpd sends a getnext request to the backend on a scalar > - libagentx increments the current OID to the OID of the table > column following the scalar, which contains no elements and after > reaching the last column it has reached the end OID of the original > request, resulting in an endOfMibView, but forgets to reset the OID to > the original start OID (as per RFC3416 section 4.2.2) > - snmpd validates the output from the backend and sees that the > OID of the EOMV doesn't match the requested OID and decides that > it doesn't trust the backend anymore; It is then closed with a > "too many parse errors" notification. > - Upon the closing of the agentx socket the backend shuts itself > down: > 1) It gets its fd from snmpd and it doesn't know where to connect > to. > 2) We don't want lingering processes if snmpd itself goes away > - Once a backend disappears snmpd shuts itself down. Basically for > the fail, fail loud reasons. > > Note that this only goes for backends under libexec/snmpd, not for > backends that connect over the agentx listener, like vmd or relayd. > > So there's no crash, just a backend that's being kicked for returning a > non-compliant varbind, which escalated to a premature exit. I also don't > expect too many people will actually hit this, because it's quite a > specific set of circumstances: I've had to set up an instance under kvm > and disable viomb(4) to get an empty sensors table, although there might > be other ways to trigger this.
Ah, there it is. Our KVM platform is Proxmox, and we go out of our way to untick the 'memory ballooning' option every time we make a VM. Up to now, I've been wondering how we managed to have such a unique setup. I will probably keep a local build of libagentx for the duration of the 7.2 lifetime and fan that out, in lieu of turning on memory ballooning just to get a sensor to exist. Also keep some instances running -current in our LibreNMS to help catch this sort of thing before next release. Thanks for the detailed explaination, and thanks again for the work to figure out cause+solution. -Ryan
