-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/3930/
-----------------------------------------------------------

(Updated Aug. 26, 2014, 10:44 p.m.)


Review request for Asterisk Developers.


Changes
-------

This addresses problems found when running the PJSIP tests in the testsuite.

The changes here are detailed further in the "EDIT" portion of the description.


Bugs: ASTERISK-24212
    https://issues.asterisk.org/jira/browse/ASTERISK-24212


Repository: Asterisk


Description (updated)
-------

In /r/3927, I introduced a changeset that fixes a scheduler race condition. I 
noted on that issue that even though I was no longer seeing the same crashes 
when running tests, I was still seeing occasional test failures due to some 
media-related race condition.

I have determined the cause of this race condition. PJSIP's inv_session API 
provides two callbacks, on_state_changed() and on_tsx_state_changed(). Both of 
these are called into when the state of a transaction changes, and the 
documentation is not particularly clear about what the difference is between 
these. Through some investigation, I've found that what happens is:

1) Transaction layer detects a state change (such as a request/response being 
sent/received, or a timeout)
2) Transaction layer informs the inv_session layer.
3) inv_session layer sets the new state and calls the on_state_changed() 
callback.
4) inv_session layer performs its processing of the transaction state change 
(including potential calls to the on_media_update() callback).
5) inv_session layer calls the on_tsx_state_changed() callback.

res_pjsip_session is hooked into step (3) such that when a new SIP request or 
response is sent or received, we call into session supplements to let them 
react to the new message. One of these session supplements is in chan_pjsip and 
queues an AST_CONTROL_ANSWER frame when a 200 OK response is received.

In the test where I am seeing the race condition occur, a call is originated to 
a PJSIP channel (Alice). Once Alice answers, her channel is placed into the 
dialplan to execute the TalkDetect() application. The problem here is that at 
step (3), the AST_CONTROL_ANSWER frame is queued onto Alice's channel, and the 
PBX thread moves the channel into the TalkDetect() application before step (4) 
is executed. Step (4) is important for setting formats on Alice's channel and 
setting appropriate translation paths as well. Since step (4) has not been 
completed yet, the attempt to play a file as part of TalkDetect() fails, and 
therefore the test fails. In most runs, however, step (4) does get completed 
before the PBX thread moves Alice's channel into TalkDetect().

The fix presented here is pretty simple. For transaction state changes, we are 
now hooked into step (5) instead of step (3) to call session supplements. This 
means that we do not call into session supplements until media has been 
negotiated on the channel. This eliminates the race condition entirely.


***EDIT 26 Aug, 2014***
After running all PJSIP tests in the testsuite, I found that things weren't 
quite so cut and dry as they were initially presented here. Here are some 
additional findings:
* For incoming 3XX responses, it's actually at step (2) that we get told about 
the redirection. There are some session supplements that need to be called back 
at this time.
* Not all session supplements can safely wait until after media handling to be 
called into. For instance, the NAT handling code that rewrites contacts needs 
to be called into before media handling on an inbound 200 OK.
* In something pretty much completely unrelated, I found that there was an 
undiscovered bug in res_pjsip_session.c that prevented incoming reinvites from 
reaching session supplements.

With regards to the first two bullet points, I have added a "response_priority" 
field to session supplements. This allows for a session supplement to tell 
res_pjsip_session at  what point during response handling it should be called 
into. All but two session supplements are called into at the same point that 
they were prior to writing this patch. However, two supplements now specify 
when they should be called into:
1) res_pjsip_diversion's supplement specifies to be called into before 
processing redirections.
2) One of chan_pjsip's session supplements specifies to be called into after 
media negotiation.

I discovered the bug in the third bullet by accidentally fixing it. Because of 
this change, there are some session supplements in chan_pjsip that now will 
check INVITE state and stop processing if dealing with a reinvite.


Diffs (updated)
-----

  /branches/12/res/res_pjsip_session.c 421883 
  /branches/12/res/res_pjsip_diversion.c 421883 
  /branches/12/include/asterisk/res_pjsip_session.h 421883 
  /branches/12/channels/chan_pjsip.c 421883 

Diff: https://reviewboard.asterisk.org/r/3930/diff/


Testing
-------

I ran the 
tests/channels/pjsip/basic_calls/two_parties/nominal/alice_initiated/alice_hangs_up
 test in a loop on my console both before and after applying this patch. Before 
applying the patch, I would usually encounter a test failure within an hour or 
two. After applying this patch, I left the test running in a loop for over 24 
hours and never had a test failure.


Thanks,

Mark Michelson

-- 
_____________________________________________________________________
-- Bandwidth and Colocation Provided by http://www.api-digital.com --

asterisk-dev mailing list
To UNSUBSCRIBE or update options visit:
   http://lists.digium.com/mailman/listinfo/asterisk-dev

Reply via email to