Re: [Ntop-misc] FW: Easily Reproducable Segfaults

Brian Behrens Mon, 14 Nov 2011 04:12:39 -0800

Luca,

I had submitted quite a few comments, but it got bounced back awaiting 
moderator approval.  I had some other recommended patches including an 
additional purged host check (which eliminated my segfaults) and an update to 
the V3 Google maps api code.


--Brian

________________________________________
From: [email protected] 
[[email protected]] on behalf of Luca Deri [[email protected]]
Sent: Friday, November 11, 2011 4:43 AM
To: [email protected]
Subject: Re: [Ntop-misc] FW:  Easily Reproducable Segfaults

Brian
I thank you for the hint. You are right that there is a problem there.
In a nuthsell the correct sequence is
1. mark hosts for deletion but do NOT free them yet
2. scan all sessions for timeout and free them included those that have
a peer (sender or received) who was  marked for deletion in step 1.
3. delete all marked hosts.

I have committed this patch and I'm testing it.

Cheers Luca

On 11/08/2011 09:21 PM, Brian Behrens wrote:
> Additional information as of today...
>
> Correct me if I am wrong here, but it appears that the host pointers are 
> stored in the Sessions (mutex???).    Most of this debugging is a bit new to 
> me, but I am figuring it out as I go along.   Anyhow,  in the hash.c file:
>
>    /* Now free the entries */
>    for(idx=0; idx<numHosts; idx++) {
> #ifdef IDLE_PURGE_DEBUG
>      traceEvent(CONST_TRACE_INFO, "IDLE_PURGE_DEBUG: Purging host %d [last 
> seen=%d]... %s",
>          idx, theFlaggedHosts[idx]->lastSeen, 
> theFlaggedHosts[idx]->hostResolvedName);
> #endif
>      freeHostInfo(theFlaggedHosts[idx], actDevice);
>      numFreedBuckets++;
>      ntop_conditional_sched_yield(); /* Allow other threads to run */
>    }
>    free(theFlaggedHosts);
>    if(myGlobals.runningPref.enableSessionHandling)
>      scanTimedoutTCPSessions(actDevice); /* let's check timedout sessions too 
> */
>
>
> This purges the hosts before running the scanTimedoutTCPSessions.
>
> If we look at the scanTimedoutTCPSessions function, we find:
>
> theSession = myGlobals.device[actualDeviceId].tcpSession[idx];
>
> and a little later....
>
> freeSession(theSession, actualDeviceId, 1, 0 /* locked by the purge thread 
> */);
>
> Looking at freeSession, we find:
>
> theHost = sessionToPurge->initiator, theRemHost = sessionToPurge->remotePeer;
>
> (Note: sessionToPurge = theSession passed on)
>
> This host pointer comes from a different location, and it is possible and I 
> have shown that the memory pointed to by this pointer can be re-used before 
> theHost is set.   This causes:
>
> if((theHost != NULL)&&  (theRemHost != NULL)&&  allocateMemoryIfNeeded) {
>
> To validate,
>
> And causes:
>
> incrementUsageCounter(&theHost->secHostPkts->closedEmptyTCPConnSent, 
> theRemHost, actualDeviceId);
>
> to segfault.
>
> So, basically, the reference pointer in the sessions storage is not being 
> purged when the hosts are.  I am trying to work around this by having the 
> scanIdleTCPSessions run before the hosts are purged to hope that the sessions 
> get purged before the hosts do, but by looking over the code, I think the 
> risk still exists that a non-purged session could refer to a purged host.   I 
> am not sure the best approach to double checking the Sessions mutex to ensure 
> the host pointer is set to null.   Also, I think this is what is causing the 
> other segfaults as well, but I am not intimate with the code to know where 
> all the host pointers are stored and potienally reffered to during execution.
>
> Again, I would think the ultimate best would be to have that pointer set to 
> NULL in the Sessions mutex when the host is purged, but the how might be very 
> difficult.
>
> --Brian
>
> ________________________________________
> From: Brian Behrens
> Sent: Monday, November 07, 2011 9:49 AM
> To: [email protected]
> Subject: RE: [Ntop-misc] Easily Reproducable Segfaults
>
> Luca,
>
> Here is the code:
>
>    if(sessionToPurge->session_info != NULL)
>      free(sessionToPurge->session_info);
>
> According to gdb the session_info points to an address 0xffffffff, which 
> causes a segfault when the free function gets called.
>
> --Brian
>
> ________________________________________
> From: [email protected] 
> [[email protected]] on behalf of Luca Deri 
> [[email protected]]
> Sent: Monday, November 07, 2011 2:29 AM
> To: [email protected]
> Subject: Re: [Ntop-misc] Easily Reproducable Segfaults
>
> Brian
> I agree with your that there's something wrong with sessions. However
> sessions.c:343 contains something different from what you reported. Can
> you please send me the source code round line 343 so I can see what you
> mean?
>
> Thanks Luca
>
> On 11/05/2011 08:09 PM, Brian Behrens wrote:
>> No problem,
>>
>> I did some more work on this and found that line 343 in sessions.c is the 
>> culprit.   Basically here is a breakdown of whats happening.
>>
>> That line attempts to free a memory at a pointer at the address specified by 
>> sessionToPurge->session_info.  When you dump what is in the address pointed 
>> to session_info, it contains 0xffffffff.   Since this is not a NULL value, 
>> it attempts to free the memory at that address which is out of bounds and 
>> causes a segfault.
>>
>> So, in perspective, its most likely trying to free memory that has already 
>> been freed.   The question becomes why is the code thinking there is still a 
>> valid memory address at that pointer?   I think I have an idea on why that 
>> might be,  I started watching the session counters and even though I have 
>> specified an upper limit of 65536 sessions, I can see the count does 
>> actually get this high.  When the count gets that high, it clears and starts 
>> over.   Now, I have not investigated on what actually transpires when this 
>> reset occurs, but my guess is that it still thinks that there are sessions 
>> that need to be purged that have already been purged by the clearing.
>>
>> I have also noticed that once that bound is reached, the count seems to stay 
>> around 14k sessions.  The ESX server I am running this on has 98Gb of 
>> memory, so memory constraints are not really a concern, this might be just 
>> tuning the max sessions to tolerate enough sessions so that the purge cycle 
>> that is supposed to purge these idle sessions can do its job effectively.
>>
>> I would think that this might be occurring on the lower load networks as the 
>> DEFAULT_NTOP_MAX_NUM_SESSIONS is set lower, and thus the limit might also 
>> being reached and causing the clear routine, and the segfault as the use of 
>> 0xffffffff is implemented in various places and could easily be stored in 
>> many memory locations.
>>
>> So, I might try to work around this by elevating the 
>> DEFAULT_NTOP_MAX_NUM_SESSIONS to see if that helps out.   Also, taking a 
>> deeper look at what happens when this bound is reached might be productive 
>> for me to understand to help eliminate this.
>>
>> I hope this helps out some as I have seen similar postings to this in the 
>> threads.
>>
>> --Brian
>> ________________________________________
>> From: [email protected] 
>> [[email protected]] on behalf of Luca Deri 
>> [[email protected]]
>> Sent: Saturday, November 05, 2011 6:22 AM
>> To: [email protected]
>> Subject: Re: [Ntop-misc] Easily Reproducable Segfaults
>>
>> Brian
>> thanks for your report. I do not have the ability to reproduce the crash you 
>> reported using the code in SVN (this is the only version I can support). Can 
>> you please crash ntop, generate a core and analyze it a bit so that I can 
>> understand where the problem could be? Before doing that, please resync with 
>> SVN.
>>
>> Thanks for your support Luca
>>
>> On Nov 4, 2011, at 5:09 PM, Brian Behrens wrote:
>>
>>> Hello,
>>>
>>> I have been working for days trying to resolve a segfault issue like the 
>>> following:
>>>
>>> Nov  4 10:46:54 NTOP-SC kernel: ntop[25479]: segfault at 645 ip 
>>> 00007f95f3cf3395 sp 00007f95e9b75ae8 error 6 in 
>>> libntop-4.1.1.so[7f95f3cb9000+56000]
>>>
>>> The environment is an ESX 5 VM.
>>>
>>> Guest OS I have tried:
>>>
>>> 1. CentOS 6
>>> 2. Fedora 15
>>> 3. Network Security Toolkit (uses 4865 of the current dev tree)
>>>
>>> Versions I have tried:
>>>
>>> 1. Current dev tree.
>>> 2. Current stable version (4.1.0)
>>>
>>> The times variate on where these faults occur, but it is relevant to 
>>> network load factors.
>>>
>>> My test networks:
>>>
>>> 1. Simple home network with all packets going to NTOP.
>>> 2. High load work network that can see 25 Gig in 15 mins.
>>>
>>> The most stable I have seen is a clean CentOS install, build ntop from 
>>> trunk tree, install and run.
>>>
>>> The quickest segfault I can obtain is when I implement PF_RING, use a e1000 
>>> card in the vm, and use the pf_ring aware e1000 driver.   Can get a 
>>> segfault usually within 30 mins on the busy network.
>>>
>>> The common theme is the segfaulting.  I did attempt a gdb on the device one 
>>> time and saw a malloc issue, but all these VMs have 4GB memory and I have 
>>> tried tuning different hash sizes to see how this impacts the issue, but it 
>>> really never does.  Use smaller hash values, and I get more messages of low 
>>> memory, etc.
>>>
>>> I am really not sure what else to do, if there is anything I can do to 
>>> present more information, please let me know as I would like to stop this 
>>> incessant segfaulting.
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Ntop-misc mailing list
>>> [email protected]
>>> http://listgateway.unipi.it/mailman/listinfo/ntop-misc
>> ---
>> We can't solve problems by using the same kind of thinking we used when we 
>> created them - Albert Einstein
>>
>> _______________________________________________
>> Ntop-misc mailing list
>> [email protected]
>> http://listgateway.unipi.it/mailman/listinfo/ntop-misc
>> _______________________________________________
>> Ntop-misc mailing list
>> [email protected]
>> http://listgateway.unipi.it/mailman/listinfo/ntop-misc
> _______________________________________________
> Ntop-misc mailing list
> [email protected]
> http://listgateway.unipi.it/mailman/listinfo/ntop-misc
> _______________________________________________
> Ntop-misc mailing list
> [email protected]
> http://listgateway.unipi.it/mailman/listinfo/ntop-misc

_______________________________________________
Ntop-misc mailing list
[email protected]
http://listgateway.unipi.it/mailman/listinfo/ntop-misc
_______________________________________________
Ntop-misc mailing list
[email protected]
http://listgateway.unipi.it/mailman/listinfo/ntop-misc

Re: [Ntop-misc] FW: Easily Reproducable Segfaults

Reply via email to