Luca, I had submitted quite a few comments, but it got bounced back awaiting moderator approval. I had some other recommended patches including an additional purged host check (which eliminated my segfaults) and an update to the V3 Google maps api code.
--Brian ________________________________________ From: [email protected] [[email protected]] on behalf of Luca Deri [[email protected]] Sent: Friday, November 11, 2011 4:43 AM To: [email protected] Subject: Re: [Ntop-misc] FW: Easily Reproducable Segfaults Brian I thank you for the hint. You are right that there is a problem there. In a nuthsell the correct sequence is 1. mark hosts for deletion but do NOT free them yet 2. scan all sessions for timeout and free them included those that have a peer (sender or received) who was marked for deletion in step 1. 3. delete all marked hosts. I have committed this patch and I'm testing it. Cheers Luca On 11/08/2011 09:21 PM, Brian Behrens wrote: > Additional information as of today... > > Correct me if I am wrong here, but it appears that the host pointers are > stored in the Sessions (mutex???). Most of this debugging is a bit new to > me, but I am figuring it out as I go along. Anyhow, in the hash.c file: > > /* Now free the entries */ > for(idx=0; idx<numHosts; idx++) { > #ifdef IDLE_PURGE_DEBUG > traceEvent(CONST_TRACE_INFO, "IDLE_PURGE_DEBUG: Purging host %d [last > seen=%d]... %s", > idx, theFlaggedHosts[idx]->lastSeen, > theFlaggedHosts[idx]->hostResolvedName); > #endif > freeHostInfo(theFlaggedHosts[idx], actDevice); > numFreedBuckets++; > ntop_conditional_sched_yield(); /* Allow other threads to run */ > } > free(theFlaggedHosts); > if(myGlobals.runningPref.enableSessionHandling) > scanTimedoutTCPSessions(actDevice); /* let's check timedout sessions too > */ > > > This purges the hosts before running the scanTimedoutTCPSessions. > > If we look at the scanTimedoutTCPSessions function, we find: > > theSession = myGlobals.device[actualDeviceId].tcpSession[idx]; > > and a little later.... > > freeSession(theSession, actualDeviceId, 1, 0 /* locked by the purge thread > */); > > Looking at freeSession, we find: > > theHost = sessionToPurge->initiator, theRemHost = sessionToPurge->remotePeer; > > (Note: sessionToPurge = theSession passed on) > > This host pointer comes from a different location, and it is possible and I > have shown that the memory pointed to by this pointer can be re-used before > theHost is set. This causes: > > if((theHost != NULL)&& (theRemHost != NULL)&& allocateMemoryIfNeeded) { > > To validate, > > And causes: > > incrementUsageCounter(&theHost->secHostPkts->closedEmptyTCPConnSent, > theRemHost, actualDeviceId); > > to segfault. > > So, basically, the reference pointer in the sessions storage is not being > purged when the hosts are. I am trying to work around this by having the > scanIdleTCPSessions run before the hosts are purged to hope that the sessions > get purged before the hosts do, but by looking over the code, I think the > risk still exists that a non-purged session could refer to a purged host. I > am not sure the best approach to double checking the Sessions mutex to ensure > the host pointer is set to null. Also, I think this is what is causing the > other segfaults as well, but I am not intimate with the code to know where > all the host pointers are stored and potienally reffered to during execution. > > Again, I would think the ultimate best would be to have that pointer set to > NULL in the Sessions mutex when the host is purged, but the how might be very > difficult. > > --Brian > > ________________________________________ > From: Brian Behrens > Sent: Monday, November 07, 2011 9:49 AM > To: [email protected] > Subject: RE: [Ntop-misc] Easily Reproducable Segfaults > > Luca, > > Here is the code: > > if(sessionToPurge->session_info != NULL) > free(sessionToPurge->session_info); > > According to gdb the session_info points to an address 0xffffffff, which > causes a segfault when the free function gets called. > > --Brian > > ________________________________________ > From: [email protected] > [[email protected]] on behalf of Luca Deri > [[email protected]] > Sent: Monday, November 07, 2011 2:29 AM > To: [email protected] > Subject: Re: [Ntop-misc] Easily Reproducable Segfaults > > Brian > I agree with your that there's something wrong with sessions. However > sessions.c:343 contains something different from what you reported. Can > you please send me the source code round line 343 so I can see what you > mean? > > Thanks Luca > > On 11/05/2011 08:09 PM, Brian Behrens wrote: >> No problem, >> >> I did some more work on this and found that line 343 in sessions.c is the >> culprit. Basically here is a breakdown of whats happening. >> >> That line attempts to free a memory at a pointer at the address specified by >> sessionToPurge->session_info. When you dump what is in the address pointed >> to session_info, it contains 0xffffffff. Since this is not a NULL value, >> it attempts to free the memory at that address which is out of bounds and >> causes a segfault. >> >> So, in perspective, its most likely trying to free memory that has already >> been freed. The question becomes why is the code thinking there is still a >> valid memory address at that pointer? I think I have an idea on why that >> might be, I started watching the session counters and even though I have >> specified an upper limit of 65536 sessions, I can see the count does >> actually get this high. When the count gets that high, it clears and starts >> over. Now, I have not investigated on what actually transpires when this >> reset occurs, but my guess is that it still thinks that there are sessions >> that need to be purged that have already been purged by the clearing. >> >> I have also noticed that once that bound is reached, the count seems to stay >> around 14k sessions. The ESX server I am running this on has 98Gb of >> memory, so memory constraints are not really a concern, this might be just >> tuning the max sessions to tolerate enough sessions so that the purge cycle >> that is supposed to purge these idle sessions can do its job effectively. >> >> I would think that this might be occurring on the lower load networks as the >> DEFAULT_NTOP_MAX_NUM_SESSIONS is set lower, and thus the limit might also >> being reached and causing the clear routine, and the segfault as the use of >> 0xffffffff is implemented in various places and could easily be stored in >> many memory locations. >> >> So, I might try to work around this by elevating the >> DEFAULT_NTOP_MAX_NUM_SESSIONS to see if that helps out. Also, taking a >> deeper look at what happens when this bound is reached might be productive >> for me to understand to help eliminate this. >> >> I hope this helps out some as I have seen similar postings to this in the >> threads. >> >> --Brian >> ________________________________________ >> From: [email protected] >> [[email protected]] on behalf of Luca Deri >> [[email protected]] >> Sent: Saturday, November 05, 2011 6:22 AM >> To: [email protected] >> Subject: Re: [Ntop-misc] Easily Reproducable Segfaults >> >> Brian >> thanks for your report. I do not have the ability to reproduce the crash you >> reported using the code in SVN (this is the only version I can support). Can >> you please crash ntop, generate a core and analyze it a bit so that I can >> understand where the problem could be? Before doing that, please resync with >> SVN. >> >> Thanks for your support Luca >> >> On Nov 4, 2011, at 5:09 PM, Brian Behrens wrote: >> >>> Hello, >>> >>> I have been working for days trying to resolve a segfault issue like the >>> following: >>> >>> Nov 4 10:46:54 NTOP-SC kernel: ntop[25479]: segfault at 645 ip >>> 00007f95f3cf3395 sp 00007f95e9b75ae8 error 6 in >>> libntop-4.1.1.so[7f95f3cb9000+56000] >>> >>> The environment is an ESX 5 VM. >>> >>> Guest OS I have tried: >>> >>> 1. CentOS 6 >>> 2. Fedora 15 >>> 3. Network Security Toolkit (uses 4865 of the current dev tree) >>> >>> Versions I have tried: >>> >>> 1. Current dev tree. >>> 2. Current stable version (4.1.0) >>> >>> The times variate on where these faults occur, but it is relevant to >>> network load factors. >>> >>> My test networks: >>> >>> 1. Simple home network with all packets going to NTOP. >>> 2. High load work network that can see 25 Gig in 15 mins. >>> >>> The most stable I have seen is a clean CentOS install, build ntop from >>> trunk tree, install and run. >>> >>> The quickest segfault I can obtain is when I implement PF_RING, use a e1000 >>> card in the vm, and use the pf_ring aware e1000 driver. Can get a >>> segfault usually within 30 mins on the busy network. >>> >>> The common theme is the segfaulting. I did attempt a gdb on the device one >>> time and saw a malloc issue, but all these VMs have 4GB memory and I have >>> tried tuning different hash sizes to see how this impacts the issue, but it >>> really never does. Use smaller hash values, and I get more messages of low >>> memory, etc. >>> >>> I am really not sure what else to do, if there is anything I can do to >>> present more information, please let me know as I would like to stop this >>> incessant segfaulting. >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> Ntop-misc mailing list >>> [email protected] >>> http://listgateway.unipi.it/mailman/listinfo/ntop-misc >> --- >> We can't solve problems by using the same kind of thinking we used when we >> created them - Albert Einstein >> >> _______________________________________________ >> Ntop-misc mailing list >> [email protected] >> http://listgateway.unipi.it/mailman/listinfo/ntop-misc >> _______________________________________________ >> Ntop-misc mailing list >> [email protected] >> http://listgateway.unipi.it/mailman/listinfo/ntop-misc > _______________________________________________ > Ntop-misc mailing list > [email protected] > http://listgateway.unipi.it/mailman/listinfo/ntop-misc > _______________________________________________ > Ntop-misc mailing list > [email protected] > http://listgateway.unipi.it/mailman/listinfo/ntop-misc _______________________________________________ Ntop-misc mailing list [email protected] http://listgateway.unipi.it/mailman/listinfo/ntop-misc _______________________________________________ Ntop-misc mailing list [email protected] http://listgateway.unipi.it/mailman/listinfo/ntop-misc
