On Mon, Oct 31, 2022 at 11:05:07AM -0700, Ryan Freeman wrote: > On Sun, Oct 30, 2022 at 09:21:00AM +0100, Martijn van Duren wrote: > > On Fri, 2022-10-28 at 13:10 -0700, Ryan Freeman wrote: > > > On Fri, Oct 28, 2022 at 01:22:57PM +0200, Martijn van Duren wrote: > > > > I wondered that as well, but I tried to simulate the not found and > > > > error code-paths, but I couldn't trigger it. So I'm not ruling it > > > > out, I just can't reproduce it. > > > > > > > > Another thing that's weird is that it looks like the index has been > > > > stripped from sensorStatus, which might be an indication that > > > > weird is going on inside libagentx. But like I said: without a > > > > reproducer I haven't been able to pin it down. > > > > > > > > So the additional verbose information should be useful. > > > > Come to think of it: The `sysctl hw.sensors` output might be > > > > helpful as well, both on a succeeding run, as well as at the time > > > > of the crash (maybe something like: > > > > `while true; do date; sysctl hw.sensors; sleep 1; done > \ > > > > /path/to/output`) > > > > > > As the offending machines are VMs, hw.sensors actually returns > > > nothing. I will send you the output for all of 'hw' key, and > > > log output for snmpd -vv when the issue arrives. > > > > > > It does seem to coincide with librenms's discovery process, which > > > comes from librenms upstream as this cron job (on a linux machine): > > > 33 */6 * * * librenms /opt/librenms/cronic > > > /opt/librenms/discovery-wrapper.py 1 > > > > > > So, it is the one job running every ~6 hours which would match up with > > > when snmpd is dying on these OpenBSD 7.2 VMs. I still have 30+ VMs > > > on <7.2 that are OK. Any physical machines I've upgraded to 7.2 are > > > only at home, not $WORKPLACE where librenms lives. Not trying to be > > > noisy, just hopefully narrow down the actual cause :) Thanks for > > > the hints! > > > > > > Regards, > > > -Ryan > > > > > > > > I managed to reproduce it with an empty sensors table and doing a > > getnext request on sensorNumber.0. > > > > The problem was that the internal OID was incremented from from > > sensorNumber.0 to sensorStatus, which then triggers an endOfMibView. > > When returning a response this incremented value is then send back to > > snmpd, while in the case of an endOfMibView it must be the value > > requested by snmpd (at least for the getnext case, which is what is > > being used here). > > > > Diff below resets this key on endOfMibView and fixes the problem for > > me. Can you confirm this? > > > > Assuming this also fixes things for Ryan: OK? > > > > martijn@ > > > > Index: agentx.c > > =================================================================== > > RCS file: /cvs/src/lib/libagentx/agentx.c,v > > retrieving revision 1.19 > > diff -u -p -r1.19 agentx.c > > --- agentx.c 14 Oct 2022 15:26:58 -0000 1.19 > > +++ agentx.c 30 Oct 2022 08:19:29 -0000 > > @@ -3426,6 +3426,8 @@ agentx_varbind_endofmibview(struct agent > > return; > > } > > > > + bcopy(&(axv->axv_start), &(axv->axv_vb.avb_oid), > > + sizeof(axv->axv_start)); > > axv->axv_vb.avb_type = AX_DATA_TYPE_ENDOFMIBVIEW; > > > > if (axv->axv_axo != NULL) > > > > Thanks Martijn, > > I applied a slightly offset patch** to a 7.2-stable tree, rebuilt libagentx > and installed the new libagentx.so.1.0 on an affected host. snmpd has been > running for just about 12 hours now, I think this might have solved it. I > am going to copy this adjusted libagentx to another host in the mean time, > and continue watching. > > -Ryan > > **Patch to 7.2-stable: > > Index: agentx.c > =================================================================== > RCS file: /cvs/src/lib/libagentx/agentx.c,v > retrieving revision 1.17 > diff -u -p -r1.17 agentx.c > --- agentx.c 13 Sep 2022 10:20:22 -0000 1.17 > +++ agentx.c 31 Oct 2022 06:29:45 -0000 > @@ -3342,6 +3342,8 @@ agentx_varbind_endofmibview(struct agent > return; > } > > + bcopy(&(axv->axv_start), &(axv->axv_vb.avb_oid), > + sizeof(axv->axv_start)); > axv->axv_vb.avb_type = AX_DATA_TYPE_ENDOFMIBVIEW; > > if (axv->axv_axo != NULL) >
I can confirm the snmpd process is no-longer disappearing with this patch. Almost 24 hours on one VM and 16 hours on another. Thanks! -Ryan
