On Mon, Oct 31, 2022 at 11:05:07AM -0700, Ryan Freeman wrote:
> On Sun, Oct 30, 2022 at 09:21:00AM +0100, Martijn van Duren wrote:
> > On Fri, 2022-10-28 at 13:10 -0700, Ryan Freeman wrote:
> > > On Fri, Oct 28, 2022 at 01:22:57PM +0200, Martijn van Duren wrote:
> > > > I wondered that as well, but I tried to simulate the not found and
> > > > error code-paths, but I couldn't trigger it. So I'm not ruling it
> > > > out, I just can't reproduce it.
> > > > 
> > > > Another thing that's weird is that it looks like the index has been
> > > > stripped from sensorStatus, which might be an indication that
> > > > weird is going on inside libagentx. But like I said: without a
> > > > reproducer I haven't been able to pin it down.
> > > > 
> > > > So the additional verbose information should be useful.
> > > > Come to think of it: The `sysctl hw.sensors` output might be
> > > > helpful as well, both on a succeeding run, as well as at the time
> > > > of the crash (maybe something like:
> > > > `while true; do date; sysctl hw.sensors; sleep 1; done > \
> > > > /path/to/output`)
> > > 
> > > As the offending machines are VMs, hw.sensors actually returns
> > > nothing.  I will send you the output for all of 'hw' key, and
> > > log output for snmpd -vv when the issue arrives.
> > > 
> > > It does seem to coincide with librenms's discovery process, which
> > > comes from librenms upstream as this cron job (on a linux machine):
> > > 33 */6 * * * librenms /opt/librenms/cronic 
> > > /opt/librenms/discovery-wrapper.py 1
> > > 
> > > So, it is the one job running every ~6 hours which would match up with
> > > when snmpd is dying on these OpenBSD 7.2 VMs.  I still have 30+ VMs
> > > on <7.2 that are OK.  Any physical machines I've upgraded to 7.2 are
> > > only at home, not $WORKPLACE where librenms lives.  Not trying to be
> > > noisy, just hopefully narrow down the actual cause :)  Thanks for
> > > the hints!
> > > 
> > > Regards,
> > > -Ryan
> > > 
> > > 
> > I managed to reproduce it with an empty sensors table and doing a
> > getnext request on sensorNumber.0.
> > 
> > The problem was that the internal OID was incremented from from
> > sensorNumber.0 to sensorStatus, which then triggers an endOfMibView.
> > When returning a response this incremented value is then send back to
> > snmpd, while in the case of an endOfMibView it must be the value
> > requested by snmpd (at least for the getnext case, which is what is
> > being used here).
> > 
> > Diff below resets this key on endOfMibView and fixes the problem for
> > me. Can you confirm this?
> > 
> > Assuming this also fixes things for Ryan: OK?
> > 
> > martijn@
> > 
> > Index: agentx.c
> > ===================================================================
> > RCS file: /cvs/src/lib/libagentx/agentx.c,v
> > retrieving revision 1.19
> > diff -u -p -r1.19 agentx.c
> > --- agentx.c        14 Oct 2022 15:26:58 -0000      1.19
> > +++ agentx.c        30 Oct 2022 08:19:29 -0000
> > @@ -3426,6 +3426,8 @@ agentx_varbind_endofmibview(struct agent
> >             return;
> >     }
> >  
> > +   bcopy(&(axv->axv_start), &(axv->axv_vb.avb_oid),
> > +       sizeof(axv->axv_start));
> >     axv->axv_vb.avb_type = AX_DATA_TYPE_ENDOFMIBVIEW;
> >  
> >     if (axv->axv_axo != NULL)
> > 
> 
> Thanks Martijn,
> 
> I applied a slightly offset patch** to a 7.2-stable tree, rebuilt libagentx
> and installed the new libagentx.so.1.0 on an affected host.  snmpd has been
> running for just about 12 hours now, I think this might have solved it.  I
> am going to copy this adjusted libagentx to another host in the mean time,
> and continue watching.
> 
> -Ryan
> 
> **Patch to 7.2-stable:
> 
> Index: agentx.c
> ===================================================================
> RCS file: /cvs/src/lib/libagentx/agentx.c,v
> retrieving revision 1.17
> diff -u -p -r1.17 agentx.c
> --- agentx.c  13 Sep 2022 10:20:22 -0000      1.17
> +++ agentx.c  31 Oct 2022 06:29:45 -0000
> @@ -3342,6 +3342,8 @@ agentx_varbind_endofmibview(struct agent
>               return;
>       }
>  
> +     bcopy(&(axv->axv_start), &(axv->axv_vb.avb_oid),
> +             sizeof(axv->axv_start));
>       axv->axv_vb.avb_type = AX_DATA_TYPE_ENDOFMIBVIEW;
>  
>       if (axv->axv_axo != NULL)
> 

I can confirm the snmpd process is no-longer disappearing with this
patch.  Almost 24 hours on one VM and 16 hours on another. Thanks!

-Ryan

Reply via email to