Re: [asterisk-dev] Bridges, T.38, and other good times

Mark Michelson Mon, 07 Dec 2015 10:00:52 -0800

On 12/06/2015 07:57 PM, Matthew Jordan wrote:

Hello all -
One of the efforts that a number of developers in the community hereat Digium have been at work at are cleaning up test failures exposedby Jenkins [1]. One of these, in particular, has been rather difficultto resolve - namely, fax/pjsip/directmedia_reinvite_t38 [2]. Thise-mail goes over what has been accomplished, and asks some questionson how we might try and fix Asterisk under this scenario.
The directmedia_reinvite_t38 test attempts to do the following:
(1) UAC1 calls UAC2 through Asterisk, with audio as the media. Thedial is performed using the 'g' flag, such that UAC2 will continue onif UAC1 hangs up.(2) UAC1 and UAC2 are configured for direct media. Asterisk sends are-INVITE to UAC1 and UAC2 to initiate direct media.(3) After responding with a 200 OK to the direct media requests, UAC1sends a re-INVITE offering T.38.
 (4) Asterisk sends an INVITE with T.38 to UAC2
(5) UAC2 sends back a 200 OK for T.38; Asterisk sends that to UAC1.Asterisk switches out of a direct media bridge to a core bridge.(6) UAC1 hangs up. Asterisk sends a re-INVITE to UAC2 for audio backto Asterisk. UAC2 responds with a 200 OK for the audio.
 (7) Asterisk ejects UAC2 back to the dialplan.
It's important to note that this test never should have passed - anupdate to the test suite "fixed" the test erroneously passing, whichled to us investigating why the scenario was failing. This test wascopied over from an identical chan_sip test, which passes.
The PJSIP stack has two issues which make life difficult for it inthis scenario:(1) The T.38 logic is implemented in res_pjsip_t38. While that is_mostly_ a very good thing - as it keeps all the fax state logicoutside of the channel driver - we are also a layer removed frominteractions that occur in the channel driver. That makes itchallenging to influence direct media checks and otherAsterisk/channel interactions.(2) Being very asynchronous, requests may be serviced that influenceT.38 state while other interactions are occurring in the core.Informing the core of what has occurred can have more race conditionsthan what occurs in chan_sip, which is single threaded.
The first bug discovered when the test was investigated was an issuein step (2). We never actually initiated a direct media re-INVITE.This was due to res_pjsip_t38 using a frame hook, and not implementingthe .consume_cb callback. That callback allows a framehook to informthe core (and also the bridging framework) of the types of frames thata framehook wants to consume. If a framehook needs audio, a directmedia bridge will be explicitly denied, and - by default - thebridging framework assumes that framehooks will want all frames.Another bug that was discovered occurred in step (6). When UAC1 sendsa BYE request, nothing informed UAC2 that the fax had ended - instead,it was merely ejected from the bridge. This meant that it kept itsT.38 session going, and Asterisk never sent a re-INVITE to UAC2. Bothof these bugs were fixed by 726ee873a6.
Except, unfortunately, the second bug wasn't really fixed.
726ee873a6 did the "right" thing by intercepting the BYE request sentby UAC1, and queueing up a control frame of typeAST_CONTROL_T38_PARAMETERS with a new state of AST_T38_TERMINATED.This is supposed to be passed on to UAC2, informing it that the T.38fax has ended, and that it should have its media re-negotiated back tothe last known state (audio) but also back to Asterisk (since wearen't going to be in a bridge any longer). Unfortunately, this codewas insufficient.
A race condition exists in this case. On the one hand, we've justqueued up a frame on UAC1's channel to be passed into the bridge,which should get tossed onto UAC2's channel. On the other hand, we'vejust told the bridging framework to kill UAC1's channel with extremeprejudice, thereby also terminating the bridge and ejecting UAC2 offinto the dialplan. In the first case, this is an asynchronous, messagepassing mechanism; in the second case, the bridging framework inspectsthe channel to see if it should be hung up on *every frame* and*immediately* starts the hangup/shutdown procedure if it knows thechannel should die. This is not asynchronous in any way. As a result,UAC1 may be hung up and the bridge dissolved before UAC2 ever gets itscontrol frame from UAC1.
There were a couple of solutions to this problem that were tried:
(1) First, I tried to make sure that enqueued control frames wereflushed out of a channel and passed over the bridge when a hangup wasdetected. In practice, this was incredibly cumbersome - some controlframes should get tossed, others need to be preserved. What was worsewas the sheer number of places the bridge dissolution can betriggered. While it wasn't hard to make sure we flushed frames off anejected channel into a bridge, it was nigh impossible to ensure thatthis occurred every single time before the other channels wereejected. Again, the bridging framework is ridiculously - perhapsludicrously - aggressive in tossing channels out of a bridge once ithas decided the bridge should be dissolved.(2) Second, I tried to make the bridge ejection process asynchronous.This was done by enqueuing another control frame onto the channelbeing ejected; when it leaves, it flushes its control frames into thebridge. When the 'ejection' control frame gets passed into thebridging core, that causes the bridge to dissolve. This worked well insome scenarios, and it also guaranteed that the T.38 control framewould be delivered. Unfortunately, in other cases, it caused all ofthe channels to hang out in the bridge ... permanently. Again, there'sa lot of edge cases in the bridging code that deal with channels beingkicked out of a bridge, and the bridge dissolving... and it was morethan I could chew on.
The long and short of it is: while Asterisk 12+ has a nice bridgingframework that hides or eliminates a lot of the horrendousmasquerade/transfer code, as well as the 'triple infinite loop' infeatures/channel that existed in Asterisk 11-, it is stillridiculously complex and prone to breaking spectacularly in subtleways. Not to mention both (1) and (2) end up being massive changes tothe design that are risky in an LTS (no one likes it when a channelcan't be hung up.)
So those ideas were scratched.
The next solution was to try a bridge mixing technology thatspecifically managed the T.38 state. This worked ... really well.Incredibly well, in fact. It avoided all of the previous problemsbecause, unlike external modules or even certain places in thebridging core, a bridge technology is guaranteed by the core to becalled in a synchronized fashion when any of the following occurs:
(1) When a bridge technology is chosen
(2) When that technology is started
(3) When that bridge has a channel added
(4) When that bridge has a channel removed
(5) When that technology is stopped
All of which covers the necessary places to know when a channel hashung up, and gives us a place where we can safely inform the otherchannels before the bridging framework starts doing mean things.bridge_t38 was the result [3]. It managed a bit of T.38 state for thetwo channels in a core bridge that were in a T.38 fax, and, when oneof them leaves, it informed the other channel that it should end itsT.38 fax.
Problem solved.

\o/

Not quite.
After merging [3] in f42d22d3a1, we noticed that the masquerade test[4] started to fail. That's a really, really bad sign. The masquerade'super test' was originally tested to stress test masquerades inAsterisk 1.8 and 11. It constructs a chain of 300 Local channels, thenoptimizes them all down to a single pair of 'real' channels. InAsterisk 12+, masquerades were eliminated in this scenario, but weinstead have a series of incredibly complex Local channeloptimization-caused bridge/swaps/merges that kick off as the Localchannels collapse and merge their bridges down to one. It's a great"canary in the coal mine" test, as when it fails, it almost certainlymeans you've introduced a dead lock into one of the more complexoperations in Asterisk - regardless of the versions.
And lo and behold, we had.
Local channels are weird. One of the 'fun things' they do is 'help'T.38 along by passing along a channel query option for T.38 state.This lets us do ridiculous things like make sure a T.38 fax worksacross a Local channel chain (and is covered by thefax/sip/local_channel_t38_queryoption test). Unfortunately, thebridge_t38 module had to query for T.38 state in its compatiblecallback - this allowed it to determine the current state of T.38 onthe channels in the bridge to see if it needed to be activated.Unfortunately, in a 300 Local channel chain, that means reachingacross 300 bridges - simultaneously - locking bridges,bridge_channels, channels, PVTs, and the entire world in the process.Since the bridge lock was already held in the compatible callback,this caused a locking inversion (no surprise there), deadlocking thewhole thing.
This is not a trivial locking situation to resolve. Even if we unlockthe bridge, we're still liable to deadlock merely by trying to lock300 bridges simultaneously. (There may even be another bug in here,but it is hardly worth trying to find or fix at this point.) And wecan't remove the query option code in chan_local, as T.38 faxes willno longer work across Local channels.
As an aside, if there's a lesson in all this, it is that synchronouscode in a heavily multi-threaded environment is bad. Message passingmay be harder to write, but it is far easier to maintain.
Anyway, as a result, I've reverted the bridge_t38 module in 75c800eb28.

So what do we do now?
The crux of this problem is that the bridging framework does not havea standard way of informing a channel when it has joined or - moreimportantly - left a bridge. Direct media has its own mechanismmanaged by the RTP engine - so it works around this. However, we havea number of scenarios where "things happen" in a bridge that involvesstate on a channel and - right now - we don't have a unified way ofhandling it. In addition to T.38, we also have channels being put onhold, DTMF traversing a channel, and more. Often, the channel driverhas this state - but instead, we have a lot of 'clean up' logic beingadded to the bridging core to handle these situations.
As I see it, we really only have two options here:
(1) Add code to the bridging framework to clean up T.38 on a channelwhen it leaves. This is kind of annoying, as it will happen on everychannel when it leaves, regardless of whether or not the channel evensupports T.38.(2) Add a new channel technology callback that a bridge can use toinform a channel driver that it is being ejected from a bridge. Thiswould give us a single place to put cleanup logic that has to happenin a channel driver when it is no longer bridged.
I'm not sure those two options will work, exactly, but it's the bestoptions that I can think of after exhausting lots of other codechanges in the bridging core. If someone has other suggestions, I'd bemore than happy to entertain them.
Matt


[1] https://jenkins.asterisk.org/
[2]https://jenkins.asterisk.org/jenkins/job/periodic-asterisk-master/75/testReport/junit/%28root%29/AsteriskTestSuite/tests_fax_pjsip_directmedia_reinvite_t38/
[3] https://gerrit.asterisk.org/#/c/1761/
[4]https://jenkins.asterisk.org/jenkins/job/periodic-asterisk-master/80/testReport/junit/%28root%29/AsteriskTestSuite/tests_masquerade/
--
Matthew Jordan
Digium, Inc. | Director of Technology
445 Jan Davis Drive NW - Huntsville, AL 35806 - USA
Check us out at: http://digium.com & http://asterisk.org


Hi Matt,

Of the two ideas you propose, the channel technology callback soundslike the better option. That way, only relevant channel drivers willneed to bother implementing the callback, limiting the scope of the work.

Something else to consider is that the T.38 bridge technology workedreally well except for when chains of local channels were involved. Themain issue was the method by which the local channels proxiedinformation. Would there be some non-earth-shattering change that couldbe made to make it so that local channels store the proxied informationlocally when the proxied information is first set? That way, when thebridge technology queries the local channel, there is no complexreaching across bridges to get the information; the local channelalready has that knowledge stored on it. I'm not 100% sure of themechanism for getting this state stored on the local channel, but I wascurious more if the idea had crossed your mind to explore that avenue.


--
_____________________________________________________________________
-- Bandwidth and Colocation Provided by http://www.api-digital.com --

asterisk-dev mailing list
To UNSUBSCRIBE or update options visit:
  http://lists.digium.com/mailman/listinfo/asterisk-dev

Re: [asterisk-dev] Bridges, T.38, and other good times

Reply via email to