On Sun, Oct 30, 2022 at 09:21:00AM +0100, Martijn van Duren wrote:
> On Fri, 2022-10-28 at 13:10 -0700, Ryan Freeman wrote:
> > On Fri, Oct 28, 2022 at 01:22:57PM +0200, Martijn van Duren wrote:
> > > I wondered that as well, but I tried to simulate the not found and
> > > error code-paths, but I couldn't trigger it. So I'm not ruling it
> > > out, I just can't reproduce it.
> > > 
> > > Another thing that's weird is that it looks like the index has been
> > > stripped from sensorStatus, which might be an indication that
> > > weird is going on inside libagentx. But like I said: without a
> > > reproducer I haven't been able to pin it down.
> > > 
> > > So the additional verbose information should be useful.
> > > Come to think of it: The `sysctl hw.sensors` output might be
> > > helpful as well, both on a succeeding run, as well as at the time
> > > of the crash (maybe something like:
> > > `while true; do date; sysctl hw.sensors; sleep 1; done > \
> > > /path/to/output`)
> > 
> > As the offending machines are VMs, hw.sensors actually returns
> > nothing.  I will send you the output for all of 'hw' key, and
> > log output for snmpd -vv when the issue arrives.
> > 
> > It does seem to coincide with librenms's discovery process, which
> > comes from librenms upstream as this cron job (on a linux machine):
> > 33 */6 * * * librenms /opt/librenms/cronic 
> > /opt/librenms/discovery-wrapper.py 1
> > 
> > So, it is the one job running every ~6 hours which would match up with
> > when snmpd is dying on these OpenBSD 7.2 VMs.  I still have 30+ VMs
> > on <7.2 that are OK.  Any physical machines I've upgraded to 7.2 are
> > only at home, not $WORKPLACE where librenms lives.  Not trying to be
> > noisy, just hopefully narrow down the actual cause :)  Thanks for
> > the hints!
> > 
> > Regards,
> > -Ryan
> > 
> > 
> I managed to reproduce it with an empty sensors table and doing a
> getnext request on sensorNumber.0.
> 
> The problem was that the internal OID was incremented from from
> sensorNumber.0 to sensorStatus, which then triggers an endOfMibView.
> When returning a response this incremented value is then send back to
> snmpd, while in the case of an endOfMibView it must be the value
> requested by snmpd (at least for the getnext case, which is what is
> being used here).
> 
> Diff below resets this key on endOfMibView and fixes the problem for
> me. Can you confirm this?
> 
> Assuming this also fixes things for Ryan: OK?
> 
> martijn@
> 
> Index: agentx.c
> ===================================================================
> RCS file: /cvs/src/lib/libagentx/agentx.c,v
> retrieving revision 1.19
> diff -u -p -r1.19 agentx.c
> --- agentx.c  14 Oct 2022 15:26:58 -0000      1.19
> +++ agentx.c  30 Oct 2022 08:19:29 -0000
> @@ -3426,6 +3426,8 @@ agentx_varbind_endofmibview(struct agent
>               return;
>       }
>  
> +     bcopy(&(axv->axv_start), &(axv->axv_vb.avb_oid),
> +         sizeof(axv->axv_start));
>       axv->axv_vb.avb_type = AX_DATA_TYPE_ENDOFMIBVIEW;
>  
>       if (axv->axv_axo != NULL)
> 

Thanks Martijn,

I applied a slightly offset patch** to a 7.2-stable tree, rebuilt libagentx
and installed the new libagentx.so.1.0 on an affected host.  snmpd has been
running for just about 12 hours now, I think this might have solved it.  I
am going to copy this adjusted libagentx to another host in the mean time,
and continue watching.

-Ryan

**Patch to 7.2-stable:

Index: agentx.c
===================================================================
RCS file: /cvs/src/lib/libagentx/agentx.c,v
retrieving revision 1.17
diff -u -p -r1.17 agentx.c
--- agentx.c    13 Sep 2022 10:20:22 -0000      1.17
+++ agentx.c    31 Oct 2022 06:29:45 -0000
@@ -3342,6 +3342,8 @@ agentx_varbind_endofmibview(struct agent
                return;
        }
 
+       bcopy(&(axv->axv_start), &(axv->axv_vb.avb_oid),
+               sizeof(axv->axv_start));
        axv->axv_vb.avb_type = AX_DATA_TYPE_ENDOFMIBVIEW;
 
        if (axv->axv_axo != NULL)

Reply via email to