-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/4071/#review13495
-----------------------------------------------------------

Ship it!


Excellent work tracking this down. *shakes fist* stupid assumptions!!!

- Mark Michelson


On Oct. 10, 2014, 6:49 p.m., Jonathan Rose wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviewboard.asterisk.org/r/4071/
> -----------------------------------------------------------
> 
> (Updated Oct. 10, 2014, 6:49 p.m.)
> 
> 
> Review request for Asterisk Developers, Matt Jordan and Mark Michelson.
> 
> 
> Bugs: ASTERISK-24321
>     https://issues.asterisk.org/jira/browse/ASTERISK-24321
> 
> 
> Repository: Asterisk
> 
> 
> Description
> -------
> 
> This issue was discovered by a rather complicated series of tests by PQ and 
> it's somewhat intermittent relying on hitting the same race conditions that 
> were being solved by r422070. When this problem hits the __sip_ack method in 
> chan_sip in particular, things go south quickly.
> 
> The gist of it is that when we attempt to remove an existing task, we mark it 
> for deletion and it is later removed from the scheduler. The deleted entry 
> doesn't get free'd on account of the scheduler caching task structures so 
> that we don't waste a bunch of effort reallocating all of the task structures 
> every time a task needs to be created/torn down. When we wanted to remove a 
> task that was currently executing, we couldn't do this immediately so we 
> would apply a deleted flag. Unfortunately we didn't bother to clear the 
> deleted flag when pulling it back off of the cache to create a new task.  
> Oops.  In any event, shenanigans ensued because the new task would be created 
> already doomed and while they would be reported as successfully scheduled, 
> ast_sched_runq would immediately delete the new task without replacing it the 
> chan_sip __sip_ack function was anticipating that the tasks would stick 
> around until it deleted them.
> 
> The fix is mind-numbingly simple for how long it took to me to figure out 
> what the heck was going on... just remember to clear the deleted flag from 
> scheduler entries when pulling them off the cache.
> 
> 
> Diffs
> -----
> 
>   /branches/12/main/sched.c 425241 
> 
> Diff: https://reviewboard.asterisk.org/r/4071/diff/
> 
> 
> Testing
> -------
> 
> We had a series of tests that, pre-patch would yield an assertion looking 
> like the following:
> 
> [Oct 10 13:12:57] ERROR[18046][C-0000000e]: chan_sip.c:4428 __sip_ack: 
> FRACK!, Failed assertion s != NULL, id=1570 (0)
> Got 14 backtrace records
> #0: [0x823d4fa] asterisk(__ast_assert_failed+0x7b) [0x823d4fa]
> #1: [0x81fe913] asterisk(_ast_sched_del+0x2b3) [0x81fe913]
> #2: [0xfeb248] /usr/lib/asterisk/modules/chan_sip.so [0xfeb248]
> #3: [0x106c930] /usr/lib/asterisk/modules/chan_sip.so [0x106c930]
> #4: [0x106d1e9] /usr/lib/asterisk/modules/chan_sip.so [0x106d1e9]
> #5: [0x106cdb4] /usr/lib/asterisk/modules/chan_sip.so [0x106cdb4]
> #6: [0x8171359] asterisk(ast_io_wait+0x14d) [0x8171359]
> #7: [0x106f0f8] /usr/lib/asterisk/modules/chan_sip.so [0x106f0f8]
> #8: [0x82399d5] asterisk [0x82399d5]d an infinite loop in the do_monitor 
> thread couple with this set of log messages:
> 
> indicating that we were anticipating to find a scheduler entry that wasn't in 
> the scheduler
> 
> Similar assertions occurred from other modules that involved schedulers but 
> no chan_sip, but those didn't clearly break the world.
> 
> Typically a walk through the specified tests (which involved lots of chan_sip 
> calls entering queues and lots of reloading of chan_sip and app_queue) would 
> cause this breakdown to occur within one or two walks across the test series 
> suggested. After performing the full set of tests 5 times each across 
> Asterisk on two separate occasions without seeing any assertions of this type 
> and without having chan_sip break down, I retested without the patch and 
> quickly ran into the problem again. I think it's safe to say that I got it.
> 
> 
> Thanks,
> 
> Jonathan Rose
> 
>

-- 
_____________________________________________________________________
-- Bandwidth and Colocation Provided by http://www.api-digital.com --

asterisk-dev mailing list
To UNSUBSCRIBE or update options visit:
   http://lists.digium.com/mailman/listinfo/asterisk-dev

Reply via email to