----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviewboard.asterisk.org/r/4071/#review13495 -----------------------------------------------------------
Ship it! Excellent work tracking this down. *shakes fist* stupid assumptions!!! - Mark Michelson On Oct. 10, 2014, 6:49 p.m., Jonathan Rose wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviewboard.asterisk.org/r/4071/ > ----------------------------------------------------------- > > (Updated Oct. 10, 2014, 6:49 p.m.) > > > Review request for Asterisk Developers, Matt Jordan and Mark Michelson. > > > Bugs: ASTERISK-24321 > https://issues.asterisk.org/jira/browse/ASTERISK-24321 > > > Repository: Asterisk > > > Description > ------- > > This issue was discovered by a rather complicated series of tests by PQ and > it's somewhat intermittent relying on hitting the same race conditions that > were being solved by r422070. When this problem hits the __sip_ack method in > chan_sip in particular, things go south quickly. > > The gist of it is that when we attempt to remove an existing task, we mark it > for deletion and it is later removed from the scheduler. The deleted entry > doesn't get free'd on account of the scheduler caching task structures so > that we don't waste a bunch of effort reallocating all of the task structures > every time a task needs to be created/torn down. When we wanted to remove a > task that was currently executing, we couldn't do this immediately so we > would apply a deleted flag. Unfortunately we didn't bother to clear the > deleted flag when pulling it back off of the cache to create a new task. > Oops. In any event, shenanigans ensued because the new task would be created > already doomed and while they would be reported as successfully scheduled, > ast_sched_runq would immediately delete the new task without replacing it the > chan_sip __sip_ack function was anticipating that the tasks would stick > around until it deleted them. > > The fix is mind-numbingly simple for how long it took to me to figure out > what the heck was going on... just remember to clear the deleted flag from > scheduler entries when pulling them off the cache. > > > Diffs > ----- > > /branches/12/main/sched.c 425241 > > Diff: https://reviewboard.asterisk.org/r/4071/diff/ > > > Testing > ------- > > We had a series of tests that, pre-patch would yield an assertion looking > like the following: > > [Oct 10 13:12:57] ERROR[18046][C-0000000e]: chan_sip.c:4428 __sip_ack: > FRACK!, Failed assertion s != NULL, id=1570 (0) > Got 14 backtrace records > #0: [0x823d4fa] asterisk(__ast_assert_failed+0x7b) [0x823d4fa] > #1: [0x81fe913] asterisk(_ast_sched_del+0x2b3) [0x81fe913] > #2: [0xfeb248] /usr/lib/asterisk/modules/chan_sip.so [0xfeb248] > #3: [0x106c930] /usr/lib/asterisk/modules/chan_sip.so [0x106c930] > #4: [0x106d1e9] /usr/lib/asterisk/modules/chan_sip.so [0x106d1e9] > #5: [0x106cdb4] /usr/lib/asterisk/modules/chan_sip.so [0x106cdb4] > #6: [0x8171359] asterisk(ast_io_wait+0x14d) [0x8171359] > #7: [0x106f0f8] /usr/lib/asterisk/modules/chan_sip.so [0x106f0f8] > #8: [0x82399d5] asterisk [0x82399d5]d an infinite loop in the do_monitor > thread couple with this set of log messages: > > indicating that we were anticipating to find a scheduler entry that wasn't in > the scheduler > > Similar assertions occurred from other modules that involved schedulers but > no chan_sip, but those didn't clearly break the world. > > Typically a walk through the specified tests (which involved lots of chan_sip > calls entering queues and lots of reloading of chan_sip and app_queue) would > cause this breakdown to occur within one or two walks across the test series > suggested. After performing the full set of tests 5 times each across > Asterisk on two separate occasions without seeing any assertions of this type > and without having chan_sip break down, I retested without the patch and > quickly ran into the problem again. I think it's safe to say that I got it. > > > Thanks, > > Jonathan Rose > >
-- _____________________________________________________________________ -- Bandwidth and Colocation Provided by http://www.api-digital.com -- asterisk-dev mailing list To UNSUBSCRIBE or update options visit: http://lists.digium.com/mailman/listinfo/asterisk-dev
