Re: [Ntop-misc] Easily Reproducable Segfaults

Luca Deri Fri, 11 Nov 2011 07:56:51 -0800

Brian
I do appreciate your help. So far everything is till up and running on my 
system, so the situation looks good. As you have seen the L7 integration is not 
completed yet, but if ntop runs stable is already a great improvement,


Cheers Luca

On Nov 11, 2011, at 4:45 PM, Brian Behrens wrote:

> Cool, thanks.
> 
> I had moved the check to before the deletion as well, but had not thought 
> about removing the purgeLimit.   
> 
> The free(l7.traffic) was causing me issues as well, but I see the new fix 
> seems to have corrected it.   
> 
> I really appreciate the help and support. 
> 
> --Brian
> 
> ________________________________________
> From: [email protected] 
> [[email protected]] on behalf of Luca Deri 
> [[email protected]]
> Sent: Friday, November 11, 2011 4:43 AM
> To: [email protected]
> Subject: Re: [Ntop-misc] FW:  Easily Reproducable Segfaults
> 
> Brian
> I thank you for the hint. You are right that there is a problem there.
> In a nuthsell the correct sequence is
> 1. mark hosts for deletion but do NOT free them yet
> 2. scan all sessions for timeout and free them included those that have
> a peer (sender or received) who was  marked for deletion in step 1.
> 3. delete all marked hosts.
> 
> I have committed this patch and I'm testing it.
> 
> Cheers Luca
> 
> On 11/08/2011 09:21 PM, Brian Behrens wrote:
>> Additional information as of today...
>> 
>> Correct me if I am wrong here, but it appears that the host pointers are 
>> stored in the Sessions (mutex???).    Most of this debugging is a bit new to 
>> me, but I am figuring it out as I go along.   Anyhow,  in the hash.c file:
>> 
>>   /* Now free the entries */
>>   for(idx=0; idx<numHosts; idx++) {
>> #ifdef IDLE_PURGE_DEBUG
>>     traceEvent(CONST_TRACE_INFO, "IDLE_PURGE_DEBUG: Purging host %d [last 
>> seen=%d]... %s",
>>         idx, theFlaggedHosts[idx]->lastSeen, 
>> theFlaggedHosts[idx]->hostResolvedName);
>> #endif
>>     freeHostInfo(theFlaggedHosts[idx], actDevice);
>>     numFreedBuckets++;
>>     ntop_conditional_sched_yield(); /* Allow other threads to run */
>>   }
>>   free(theFlaggedHosts);
>>   if(myGlobals.runningPref.enableSessionHandling)
>>     scanTimedoutTCPSessions(actDevice); /* let's check timedout sessions too 
>> */
>> 
>> 
>> This purges the hosts before running the scanTimedoutTCPSessions.
>> 
>> If we look at the scanTimedoutTCPSessions function, we find:
>> 
>> theSession = myGlobals.device[actualDeviceId].tcpSession[idx];
>> 
>> and a little later....
>> 
>> freeSession(theSession, actualDeviceId, 1, 0 /* locked by the purge thread 
>> */);
>> 
>> Looking at freeSession, we find:
>> 
>> theHost = sessionToPurge->initiator, theRemHost = sessionToPurge->remotePeer;
>> 
>> (Note: sessionToPurge = theSession passed on)
>> 
>> This host pointer comes from a different location, and it is possible and I 
>> have shown that the memory pointed to by this pointer can be re-used before 
>> theHost is set.   This causes:
>> 
>> if((theHost != NULL)&&  (theRemHost != NULL)&&  allocateMemoryIfNeeded) {
>> 
>> To validate,
>> 
>> And causes:
>> 
>> incrementUsageCounter(&theHost->secHostPkts->closedEmptyTCPConnSent, 
>> theRemHost, actualDeviceId);
>> 
>> to segfault.
>> 
>> So, basically, the reference pointer in the sessions storage is not being 
>> purged when the hosts are.  I am trying to work around this by having the 
>> scanIdleTCPSessions run before the hosts are purged to hope that the 
>> sessions get purged before the hosts do, but by looking over the code, I 
>> think the risk still exists that a non-purged session could refer to a 
>> purged host.   I am not sure the best approach to double checking the 
>> Sessions mutex to ensure the host pointer is set to null.   Also, I think 
>> this is what is causing the other segfaults as well, but I am not intimate 
>> with the code to know where all the host pointers are stored and potienally 
>> reffered to during execution.
>> 
>> Again, I would think the ultimate best would be to have that pointer set to 
>> NULL in the Sessions mutex when the host is purged, but the how might be 
>> very difficult.
>> 
>> --Brian
>> 
>> ________________________________________
>> From: Brian Behrens
>> Sent: Monday, November 07, 2011 9:49 AM
>> To: [email protected]
>> Subject: RE: [Ntop-misc] Easily Reproducable Segfaults
>> 
>> Luca,
>> 
>> Here is the code:
>> 
>>   if(sessionToPurge->session_info != NULL)
>>     free(sessionToPurge->session_info);
>> 
>> According to gdb the session_info points to an address 0xffffffff, which 
>> causes a segfault when the free function gets called.
>> 
>> --Brian
>> 
>> ________________________________________
>> From: [email protected] 
>> [[email protected]] on behalf of Luca Deri 
>> [[email protected]]
>> Sent: Monday, November 07, 2011 2:29 AM
>> To: [email protected]
>> Subject: Re: [Ntop-misc] Easily Reproducable Segfaults
>> 
>> Brian
>> I agree with your that there's something wrong with sessions. However
>> sessions.c:343 contains something different from what you reported. Can
>> you please send me the source code round line 343 so I can see what you
>> mean?
>> 
>> Thanks Luca
>> 
>> On 11/05/2011 08:09 PM, Brian Behrens wrote:
>>> No problem,
>>> 
>>> I did some more work on this and found that line 343 in sessions.c is the 
>>> culprit.   Basically here is a breakdown of whats happening.
>>> 
>>> That line attempts to free a memory at a pointer at the address specified 
>>> by sessionToPurge->session_info.  When you dump what is in the address 
>>> pointed to session_info, it contains 0xffffffff.   Since this is not a NULL 
>>> value, it attempts to free the memory at that address which is out of 
>>> bounds and causes a segfault.
>>> 
>>> So, in perspective, its most likely trying to free memory that has already 
>>> been freed.   The question becomes why is the code thinking there is still 
>>> a valid memory address at that pointer?   I think I have an idea on why 
>>> that might be,  I started watching the session counters and even though I 
>>> have specified an upper limit of 65536 sessions, I can see the count does 
>>> actually get this high.  When the count gets that high, it clears and 
>>> starts over.   Now, I have not investigated on what actually transpires 
>>> when this reset occurs, but my guess is that it still thinks that there are 
>>> sessions that need to be purged that have already been purged by the 
>>> clearing.
>>> 
>>> I have also noticed that once that bound is reached, the count seems to 
>>> stay around 14k sessions.  The ESX server I am running this on has 98Gb of 
>>> memory, so memory constraints are not really a concern, this might be just 
>>> tuning the max sessions to tolerate enough sessions so that the purge cycle 
>>> that is supposed to purge these idle sessions can do its job effectively.
>>> 
>>> I would think that this might be occurring on the lower load networks as 
>>> the DEFAULT_NTOP_MAX_NUM_SESSIONS is set lower, and thus the limit might 
>>> also being reached and causing the clear routine, and the segfault as the 
>>> use of 0xffffffff is implemented in various places and could easily be 
>>> stored in many memory locations.
>>> 
>>> So, I might try to work around this by elevating the 
>>> DEFAULT_NTOP_MAX_NUM_SESSIONS to see if that helps out.   Also, taking a 
>>> deeper look at what happens when this bound is reached might be productive 
>>> for me to understand to help eliminate this.
>>> 
>>> I hope this helps out some as I have seen similar postings to this in the 
>>> threads.
>>> 
>>> --Brian
>>> ________________________________________
>>> From: [email protected] 
>>> [[email protected]] on behalf of Luca Deri 
>>> [[email protected]]
>>> Sent: Saturday, November 05, 2011 6:22 AM
>>> To: [email protected]
>>> Subject: Re: [Ntop-misc] Easily Reproducable Segfaults
>>> 
>>> Brian
>>> thanks for your report. I do not have the ability to reproduce the crash 
>>> you reported using the code in SVN (this is the only version I can 
>>> support). Can you please crash ntop, generate a core and analyze it a bit 
>>> so that I can understand where the problem could be? Before doing that, 
>>> please resync with SVN.
>>> 
>>> Thanks for your support Luca
>>> 
>>> On Nov 4, 2011, at 5:09 PM, Brian Behrens wrote:
>>> 
>>>> Hello,
>>>> 
>>>> I have been working for days trying to resolve a segfault issue like the 
>>>> following:
>>>> 
>>>> Nov  4 10:46:54 NTOP-SC kernel: ntop[25479]: segfault at 645 ip 
>>>> 00007f95f3cf3395 sp 00007f95e9b75ae8 error 6 in 
>>>> libntop-4.1.1.so[7f95f3cb9000+56000]
>>>> 
>>>> The environment is an ESX 5 VM.
>>>> 
>>>> Guest OS I have tried:
>>>> 
>>>> 1. CentOS 6
>>>> 2. Fedora 15
>>>> 3. Network Security Toolkit (uses 4865 of the current dev tree)
>>>> 
>>>> Versions I have tried:
>>>> 
>>>> 1. Current dev tree.
>>>> 2. Current stable version (4.1.0)
>>>> 
>>>> The times variate on where these faults occur, but it is relevant to 
>>>> network load factors.
>>>> 
>>>> My test networks:
>>>> 
>>>> 1. Simple home network with all packets going to NTOP.
>>>> 2. High load work network that can see 25 Gig in 15 mins.
>>>> 
>>>> The most stable I have seen is a clean CentOS install, build ntop from 
>>>> trunk tree, install and run.
>>>> 
>>>> The quickest segfault I can obtain is when I implement PF_RING, use a 
>>>> e1000 card in the vm, and use the pf_ring aware e1000 driver.   Can get a 
>>>> segfault usually within 30 mins on the busy network.
>>>> 
>>>> The common theme is the segfaulting.  I did attempt a gdb on the device 
>>>> one time and saw a malloc issue, but all these VMs have 4GB memory and I 
>>>> have tried tuning different hash sizes to see how this impacts the issue, 
>>>> but it really never does.  Use smaller hash values, and I get more 
>>>> messages of low memory, etc.
>>>> 
>>>> I am really not sure what else to do, if there is anything I can do to 
>>>> present more information, please let me know as I would like to stop this 
>>>> incessant segfaulting.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Ntop-misc mailing list
>>>> [email protected]
>>>> http://listgateway.unipi.it/mailman/listinfo/ntop-misc
>>> ---
>>> We can't solve problems by using the same kind of thinking we used when we 
>>> created them - Albert Einstein
>>> 
>>> _______________________________________________
>>> Ntop-misc mailing list
>>> [email protected]
>>> http://listgateway.unipi.it/mailman/listinfo/ntop-misc
>>> _______________________________________________
>>> Ntop-misc mailing list
>>> [email protected]
>>> http://listgateway.unipi.it/mailman/listinfo/ntop-misc
>> _______________________________________________
>> Ntop-misc mailing list
>> [email protected]
>> http://listgateway.unipi.it/mailman/listinfo/ntop-misc
>> _______________________________________________
>> Ntop-misc mailing list
>> [email protected]
>> http://listgateway.unipi.it/mailman/listinfo/ntop-misc
> 
> _______________________________________________
> Ntop-misc mailing list
> [email protected]
> http://listgateway.unipi.it/mailman/listinfo/ntop-misc
> _______________________________________________
> Ntop-misc mailing list
> [email protected]
> http://listgateway.unipi.it/mailman/listinfo/ntop-misc

---

"Debugging is twice as hard as writing the code in the first place. Therefore, 
if you write the code as cleverly as possible, you are, by definition, not 
smart enough to debug it. - Brian W. Kernighan

_______________________________________________
Ntop-misc mailing list
[email protected]
http://listgateway.unipi.it/mailman/listinfo/ntop-misc

Re: [Ntop-misc] Easily Reproducable Segfaults

Reply via email to