Re: [asterisk-dev] [Code Review] 4135: Resolve race condition that can result in ARI apps not being notified of transfers

Mark Michelson Tue, 04 Nov 2014 13:45:48 -0800

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/4135/
-----------------------------------------------------------


(Updated Nov. 4, 2014, 9:44 p.m.)


Review request for Asterisk Developers.


Changes
-------

Because of what Richard pointed out in the previous review, the approach I have 
taken was not workable due to the race condition not being resolved in parking 
code.

Going by the approaches listed in the original description, I have shifted from 
approach 4 to approach 2. This means there is an API change in the way that 
transfers are published. Before this change, the call to publish a transfer was 
a single thing that created the transfer message and published. Now this has 
been separated into functions to create the transfer message and to publish it. 
For blind transfers, we take the approach of creating a minimal transfer 
message and fill in the details as we get them. Then we publish when we're 
ready to do so. For attended transfers, we create a minimal transfer message as 
well, but we have helper functions to complete the transfer message depending 
on how the transfer ends up progressing. With this approach, snapshots of 
bridges are taken early in the transfer process so that when it comes time to 
publish the transfer, we publish details as they existed when the transfer was 
initiated.


Repository: Asterisk


Description (updated)
-------

During blind transfer testing, it was noticed that tests were failing 
occasionally because the ARI blind transfer event was not being sent. After 
investigating, I detected a race condition in the blind transfer code. When 
blind transferring a single channel, the actual transfer operation (i.e. 
removing the transferee from the bridge and directing them to the proper 
dialplan location) is queued onto the transferee bridge channel. After queuing 
the transfer operation, the blind transfer Stasis message is published. At the 
time of publication, snapshots of the channels and bridge involved are created. 
The ARI subscriber to the blind transfer Stasis message then attempts to 
determine if the bridge or any of the involved channels are subscribed to by 
ARI applications. If so, then the blind transfer message is sent to the 
applications. The way that the ARI blind transfer message handler works is to 
first see if the transferer channel is subscribed to. If not, then iterate over 
all the chan
 nel IDs in the bridge snapshot and determine if any of those are subscribed 
to. In the test we were running, the lone transferee channel was subscribed to, 
so an ARI event should have been sent to our application. Occasionally, though, 
the bridge snapshot did not have any channels IDs on it at all. Why?

The problem is that since the blind transfer operation is handled by a separate 
thread, it is possible that the transfer will have completed and the channels 
removed from the bridge before we publish the blind transfer Stasis message. 
Since the blind transfer has completed, the bridge on which the transfer 
occurred no longer has any channels on it, so the resulting bridge snapshot has 
no channels on it. Through investigation of the code, I found that attended 
transfers can have this issue too for the case where a transferee is 
transferred to an application.

To fix this problem, I thought of four possible fixes:
1) Let the thread that actually performs the blind transfer publish the Stasis 
message.
2) Get the bridge snapshot before queuing the blind transfer operation.
3) Publish the blind transfer Stasis message before queuing the blind transfer 
operation.
4) Hold the bridge lock while queuing the blind transfer operation and 
publishing the blind transfer Stasis message.

Option 1 is clearly the "best" option, but it also is made nearly impossible 
due to the way that bridge channel operations are queued. Bridge channel 
operations use frames, which require that their payload be copyable using 
memcpy(). Including any sort of pointer is a no-no. So I would be forced to 
come up with some inane method of representing multiple channels and bridges in 
a frame in order to do this.

Option 2 is slightly more workable. Currently, there are functions to publish 
blind and attended transfers that require bridges and channels, not snapshots. 
Changing the API to accommodate snapshots is possible, but it is more 
widespread than I would like, and it changes the API. It also runs the slight 
risk of publishing "stale" data, though that is not likely.

Option 3 is easiest to implement, but it runs the (very slight) risk that we 
could publish that a blind transfer happened successfully when in fact the 
attempt to queue the blind transfer operation failed. I almost went with this, 
but I felt like the possibility of lying in our events was a bad thing to do.

Option 4 is my least favorite, but within a release felt like the most viable 
method. The bridge is already locked while publishing the transfer messages, 
and I didn't see any harm in having the bridge locked while queuing the 
transfer operation either. By holding the bridge lock the entire time, we 
prevent the bridge state from changing out from under us while we publish our 
Stasis message. This means that the bridge will stay stable until we finally 
unlock after publishing the Stasis message.

EDIT NOV 4: Richard pointed out a problem with Option 4, so from revision 3 
onwards, the approach now taken is option 2. See the notes on revision 3 for 
more details about the changes.


Diffs (updated)
-----

  /branches/12/res/stasis/app.c 426802 
  /branches/12/main/stasis_bridges.c 426802 
  /branches/12/main/cel.c 426802 
  /branches/12/main/bridge_basic.c 426802 
  /branches/12/main/bridge.c 426802 
  /branches/12/include/asterisk/stasis_bridges.h 426802 
  /branches/12/apps/app_queue.c 426802 

Diff: https://reviewboard.asterisk.org/r/4135/diff/


Testing
-------

I have re-run the blind transfer tests that had occasionally failed, and I do 
not see failures. More importantly, the tests still function properly and the 
change to locking has not introduced any noticeable bad effects.


Thanks,

Mark Michelson

-- 
_____________________________________________________________________
-- Bandwidth and Colocation Provided by http://www.api-digital.com --

asterisk-dev mailing list
To UNSUBSCRIBE or update options visit:
   http://lists.digium.com/mailman/listinfo/asterisk-dev

Re: [asterisk-dev] [Code Review] 4135: Resolve race condition that can result in ARI apps not being notified of transfers

Reply via email to