On Sun, Oct 30, 2022 at 09:21:00AM +0100, Martijn van Duren wrote:
> On Fri, 2022-10-28 at 13:10 -0700, Ryan Freeman wrote:
> > On Fri, Oct 28, 2022 at 01:22:57PM +0200, Martijn van Duren wrote:
> > > I wondered that as well, but I tried to simulate the not found and
> > > error code-paths, but I couldn't trigger it. So I'm not ruling it
> > > out, I just can't reproduce it.
> > >
> > > Another thing that's weird is that it looks like the index has been
> > > stripped from sensorStatus, which might be an indication that
> > > weird is going on inside libagentx. But like I said: without a
> > > reproducer I haven't been able to pin it down.
> > >
> > > So the additional verbose information should be useful.
> > > Come to think of it: The `sysctl hw.sensors` output might be
> > > helpful as well, both on a succeeding run, as well as at the time
> > > of the crash (maybe something like:
> > > `while true; do date; sysctl hw.sensors; sleep 1; done > \
> > > /path/to/output`)
> >
> > As the offending machines are VMs, hw.sensors actually returns
> > nothing. I will send you the output for all of 'hw' key, and
> > log output for snmpd -vv when the issue arrives.
> >
> > It does seem to coincide with librenms's discovery process, which
> > comes from librenms upstream as this cron job (on a linux machine):
> > 33 */6 * * * librenms /opt/librenms/cronic
> > /opt/librenms/discovery-wrapper.py 1
> >
> > So, it is the one job running every ~6 hours which would match up with
> > when snmpd is dying on these OpenBSD 7.2 VMs. I still have 30+ VMs
> > on <7.2 that are OK. Any physical machines I've upgraded to 7.2 are
> > only at home, not $WORKPLACE where librenms lives. Not trying to be
> > noisy, just hopefully narrow down the actual cause :) Thanks for
> > the hints!
> >
> > Regards,
> > -Ryan
> >
> >
> I managed to reproduce it with an empty sensors table and doing a
> getnext request on sensorNumber.0.
>
> The problem was that the internal OID was incremented from from
> sensorNumber.0 to sensorStatus, which then triggers an endOfMibView.
> When returning a response this incremented value is then send back to
> snmpd, while in the case of an endOfMibView it must be the value
> requested by snmpd (at least for the getnext case, which is what is
> being used here).
>
> Diff below resets this key on endOfMibView and fixes the problem for
> me. Can you confirm this?
>
> Assuming this also fixes things for Ryan: OK?
>
> martijn@
>
> Index: agentx.c
> ===================================================================
> RCS file: /cvs/src/lib/libagentx/agentx.c,v
> retrieving revision 1.19
> diff -u -p -r1.19 agentx.c
> --- agentx.c 14 Oct 2022 15:26:58 -0000 1.19
> +++ agentx.c 30 Oct 2022 08:19:29 -0000
> @@ -3426,6 +3426,8 @@ agentx_varbind_endofmibview(struct agent
> return;
> }
>
> + bcopy(&(axv->axv_start), &(axv->axv_vb.avb_oid),
> + sizeof(axv->axv_start));
> axv->axv_vb.avb_type = AX_DATA_TYPE_ENDOFMIBVIEW;
>
> if (axv->axv_axo != NULL)
>
Thanks Martijn,
I applied a slightly offset patch** to a 7.2-stable tree, rebuilt libagentx
and installed the new libagentx.so.1.0 on an affected host. snmpd has been
running for just about 12 hours now, I think this might have solved it. I
am going to copy this adjusted libagentx to another host in the mean time,
and continue watching.
-Ryan
**Patch to 7.2-stable:
Index: agentx.c
===================================================================
RCS file: /cvs/src/lib/libagentx/agentx.c,v
retrieving revision 1.17
diff -u -p -r1.17 agentx.c
--- agentx.c 13 Sep 2022 10:20:22 -0000 1.17
+++ agentx.c 31 Oct 2022 06:29:45 -0000
@@ -3342,6 +3342,8 @@ agentx_varbind_endofmibview(struct agent
return;
}
+ bcopy(&(axv->axv_start), &(axv->axv_vb.avb_oid),
+ sizeof(axv->axv_start));
axv->axv_vb.avb_type = AX_DATA_TYPE_ENDOFMIBVIEW;
if (axv->axv_axo != NULL)