RE: Problems with patch 1633670 - AgentX timeout and disconnect scenarios

ken_farnen Fri, 07 Sep 2012 02:20:18 -0700

I've opened bug 3565004 for this issue (whilst it's related to the previous 
AgentX disconnect issue, it's more of a new bug, or, indeed, a bug in the patch 
for the previous bug).


Lots of useful stuff uploaded into the bug report, including a stripped down 
demo subagent and script that demonstrates the bug pretty well on our systems 
here.

Further investigation indicates this is related to GETNEXT requests with 
multiple requested OIDs, where at least some of the OIDs cause the GETNEXT to 
walk off the end of our MIB (into the next adjacent MIB).  These are the kind 
of requests that the script provided sends, and I can get both 5.7.1 and 
5.8-dev to crash pretty quickly using them.

All assistance gratefully accepted!

I'm happy to work together on a fix, but I've reached the end of my 
understanding of the data-structures that are being manipulated, so, whilst I 
can get a good idea of what is going on, I'm at a loss to work out how to fix 
it.

If there is more useful information I can provide, just ask!  This is rather an 
annoying and troublesome bug!

Ken.

----

From: FARNEN,KEN (Non-A-UnitedKingdom,ex1) 
Sent: 28 August 2012 11:19
To: 'net-snmp-coders@lists.sourceforge.net'
Subject: Problems with patch 1633670 - AgentX timeout and disconnect scenarios

Hi All,

I'm currently trying to chase down a nasty bug in Net-SNMP for my current 
client, and I've pretty much hit the brick-wall of my own understanding of the 
way things are supposed to work, so I'm hoping things may make a little more 
sense to those who know the code better than I.

The Scenario:

We've got an application that registers as an AgentX subagent in order to 
answer queries for a private MIB related to the applications state.

Platform is Montavista linux on x86 (specifically, glibc 2.3.3, kernel 
2.6.10-x86 and 2.6.21-x86-64).

We've been experiencing random crashes in the field for some time now, which 
seemed to be load related, and after much tracing and head-scratching, we've 
found the culprit to be snmpd.  Specifically, the problem appears to be that 
under load on our app, the AgentX queries sometimes time-out (application 
prioritises it's primary function over SNMP, so sometimes AgentX queries get 
queued up a bit), and the situation where snmpd disconnects the session due to 
time-out is not handled well.  Worse, shutting down our app. Is very likely to 
kill snmpd if there are requests outstanding at the point of shutdown (quite 
possible if the request load is high).

I've built a test environment that can exercise this bug, so I've been able to 
do some investigation:

5.6.1 and 5.7.1 "stock" builds dump core (Segfault) when AgentX connection 
times out or disconnects

We've tried the "subagent_free_cache" patch (which is the same as the patch in 
1633670) on both 5.6.1 and 5.7.1 and this results in an infinite loop in the 
following code in "agent/mibgroup/agentx/master_admin.c", function 
"close_agentx_session()":

        if (session->subsession != NULL) {
            netsnmp_session *subsession = session->subsession;
            for(; subsession; subsession = subsession->next) {
                while 
(netsnmp_remove_delegated_requests_for_session(subsession)) {
                        DEBUGMSGTL(("agentx/master", "Continue removing 
delegated subsession reqests\n"));

It loops forever on the while, with the return value never decreasing.  (log 
message and spelling mistake repeated ad-infinitum, 100% CPU load for snmpd).

I've also tried the current trunk version, which has the 1633670 patch already 
applied, and get the same behaviour.

After lots of additional debugging, the culprit behaviour appears to be that 
"netsnmp_remove_delegated_requests_for_session()" removes (or, more correctly, 
uses "netsnmp_request_set_error()" on) everything is the agent_delegated_list 
that matches the target session, then calls 
"netsnmp_check_outstanding_agent_requests()", which walks the agent_delegated 
list and de-queues anything that passes "netsnmp_check_for_delegated()".  
However, there appear to be requests in the subsession list that don't match, 
and thus are still marked as delegated, and thus don't pass check_for_delegated 
and... Repeat until bored..

I've tried making (and using) a more aggressive flavour of 
"netsnmp_remove_delegated_requests_for_session()" that doesn't have the:

            if(request->subtree->session != sess)
                continue;

Test, but that don't fix it!  Note that "..check_for_delegated()" checks in 
asp->treecache, but "..remove_delegated_requests.." removes the requests from 
[agent_delegated_list]->requests, and it appears in our case the two don't 
quite meet up.

I've tried writing an even more aggressive version of 
"netsnmp_remove_delegated_requests_for_session()" that eats every delegated 
request In the treecache, which, to be fair, stops the infinite loop above, but 
just causes snmpd to go catatonic elsewhere.

.and that's where my understanding of these inter-related data structures 
stops, I'm afraid!

I'm sort of hoping that those that live, eat and breathe this code will have 
some suggestions.

Other info that may help:

My test SNMP query set is a set of SNMP GET and GETNEXTs taken from a customer 
network capture - they all hit the MIB that is delegated to our AgentX 
subagent, however, some of the GETNEXTs walk off the end of our MIB and into 
the next enterprise along (which happens to be the NET-SNMP MIB, in our 
particular case).

Ken Farnen.

Agilent don't authorise me to order paperclips, much less speak on their 
behalf, I'm just a freelance consultant who happens to sit at one of their 
desks at the moment, anything I say is my opinion only, and nothing to do with 
my Client!





------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Net-snmp-coders mailing list
Net-snmp-coders@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/net-snmp-coders

RE: Problems with patch 1633670 - AgentX timeout and disconnect scenarios

Reply via email to