I've opened bug 3565004 for this issue (whilst it's related to the previous AgentX disconnect issue, it's more of a new bug, or, indeed, a bug in the patch for the previous bug).
Lots of useful stuff uploaded into the bug report, including a stripped down demo subagent and script that demonstrates the bug pretty well on our systems here. Further investigation indicates this is related to GETNEXT requests with multiple requested OIDs, where at least some of the OIDs cause the GETNEXT to walk off the end of our MIB (into the next adjacent MIB). These are the kind of requests that the script provided sends, and I can get both 5.7.1 and 5.8-dev to crash pretty quickly using them. All assistance gratefully accepted! I'm happy to work together on a fix, but I've reached the end of my understanding of the data-structures that are being manipulated, so, whilst I can get a good idea of what is going on, I'm at a loss to work out how to fix it. If there is more useful information I can provide, just ask! This is rather an annoying and troublesome bug! Ken. ---- From: FARNEN,KEN (Non-A-UnitedKingdom,ex1) Sent: 28 August 2012 11:19 To: 'net-snmp-coders@lists.sourceforge.net' Subject: Problems with patch 1633670 - AgentX timeout and disconnect scenarios Hi All, I'm currently trying to chase down a nasty bug in Net-SNMP for my current client, and I've pretty much hit the brick-wall of my own understanding of the way things are supposed to work, so I'm hoping things may make a little more sense to those who know the code better than I. The Scenario: We've got an application that registers as an AgentX subagent in order to answer queries for a private MIB related to the applications state. Platform is Montavista linux on x86 (specifically, glibc 2.3.3, kernel 2.6.10-x86 and 2.6.21-x86-64). We've been experiencing random crashes in the field for some time now, which seemed to be load related, and after much tracing and head-scratching, we've found the culprit to be snmpd. Specifically, the problem appears to be that under load on our app, the AgentX queries sometimes time-out (application prioritises it's primary function over SNMP, so sometimes AgentX queries get queued up a bit), and the situation where snmpd disconnects the session due to time-out is not handled well. Worse, shutting down our app. Is very likely to kill snmpd if there are requests outstanding at the point of shutdown (quite possible if the request load is high). I've built a test environment that can exercise this bug, so I've been able to do some investigation: 5.6.1 and 5.7.1 "stock" builds dump core (Segfault) when AgentX connection times out or disconnects We've tried the "subagent_free_cache" patch (which is the same as the patch in 1633670) on both 5.6.1 and 5.7.1 and this results in an infinite loop in the following code in "agent/mibgroup/agentx/master_admin.c", function "close_agentx_session()": if (session->subsession != NULL) { netsnmp_session *subsession = session->subsession; for(; subsession; subsession = subsession->next) { while (netsnmp_remove_delegated_requests_for_session(subsession)) { DEBUGMSGTL(("agentx/master", "Continue removing delegated subsession reqests\n")); It loops forever on the while, with the return value never decreasing. (log message and spelling mistake repeated ad-infinitum, 100% CPU load for snmpd). I've also tried the current trunk version, which has the 1633670 patch already applied, and get the same behaviour. After lots of additional debugging, the culprit behaviour appears to be that "netsnmp_remove_delegated_requests_for_session()" removes (or, more correctly, uses "netsnmp_request_set_error()" on) everything is the agent_delegated_list that matches the target session, then calls "netsnmp_check_outstanding_agent_requests()", which walks the agent_delegated list and de-queues anything that passes "netsnmp_check_for_delegated()". However, there appear to be requests in the subsession list that don't match, and thus are still marked as delegated, and thus don't pass check_for_delegated and... Repeat until bored.. I've tried making (and using) a more aggressive flavour of "netsnmp_remove_delegated_requests_for_session()" that doesn't have the: if(request->subtree->session != sess) continue; Test, but that don't fix it! Note that "..check_for_delegated()" checks in asp->treecache, but "..remove_delegated_requests.." removes the requests from [agent_delegated_list]->requests, and it appears in our case the two don't quite meet up. I've tried writing an even more aggressive version of "netsnmp_remove_delegated_requests_for_session()" that eats every delegated request In the treecache, which, to be fair, stops the infinite loop above, but just causes snmpd to go catatonic elsewhere. .and that's where my understanding of these inter-related data structures stops, I'm afraid! I'm sort of hoping that those that live, eat and breathe this code will have some suggestions. Other info that may help: My test SNMP query set is a set of SNMP GET and GETNEXTs taken from a customer network capture - they all hit the MIB that is delegated to our AgentX subagent, however, some of the GETNEXTs walk off the end of our MIB and into the next enterprise along (which happens to be the NET-SNMP MIB, in our particular case). Ken Farnen. Agilent don't authorise me to order paperclips, much less speak on their behalf, I'm just a freelance consultant who happens to sit at one of their desks at the moment, anything I say is my opinion only, and nothing to do with my Client! ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Net-snmp-coders mailing list Net-snmp-coders@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/net-snmp-coders