Painful as this was, hats off to you for writing this up and sharing. Much appreciated!
On Mon, Apr 30, 2018 at 3:36 PM, Ryan Huff <[email protected]> wrote: > So here is a *neat* little situation I ran into recently, and is worth > sharing and reading; if this saves a life it was worth the crap I had to go > through ….. > > > > == The Scenario == > > > > - Expressway C/E 8.10.3 cluster over wan (2 Control Peers, 2 Edge > Peers) > - Customer deployed and managed SD-WAN solution in front of the Edge > cluster to the Internet (with two separate transport carriers). I think it > was Palos, but we’ll call it a whitebox’ed solution for our purposes > - Using MRA and B2B Expressway configs > - UAT for MRA and B2B is accepted and works great > > > > == The Problem == > > > > The customer applies the zone/search rule config in Expressway for CMR and > notices that randomly, during a presentation session in the CMR, the BFCP > server (AKA, the WebEx meeting) will close the BFCP presentation to the > endpoint coming from the customer’s Expressway; all other BFCP clients are > still receiving the BFCP presentation. That’s right, it *appears* that > WebEx *kicked* the BFCP participant coming from the customer’s Edge, but > not because the BFCP server closed the session (all other participants > remain)! Although it was happening randomly’ish in length of time into the > presentation, it would always happen at some point to the endpoint, > generally around the 2 minute’ish mark. > > > > == The diagnosis == > > > > Although random, a consistent’ish length would seem to suggest a timer / > re-invite of some flavor, and that would be wrong, as ultimately uncovered. > Sparing you all the gory tales of escalation and vendor bus underskirt > sliding; the issue was in fact, the SD-WAN solution itself. > > > > == The Explanation & The Fix == > > > > What was happening is that every 120 seconds or so, the BFCP server (WebEx > meeting) would send a UDP BFCP packet to all the BFCP presentation > subscribers. The customer’s SD-WAN solution was *identifying* these > packets according to the customer (gotta love layer 7 capable firewalls 😊) > and queueing them onto a physically different link than which the stream > was on, thus creating *physical asymmetry, delay and latency*. I > specifically requested that all inspection capabilities be turned off for > the traffic but I guess that isn’t the same as “identifying the traffic” …. > Lol. In a TCP stream, this would likely be tolerated to a degree as packet > loss or delay and/or jitter and would simply re transmit ….. but we are > dealing with *UDP* here, no bueno. > > > > To resolve, the customer had to identify and classify the traffic and > force a active/failover transmission through the SD-WAN solution for that > traffic, rather than a “load balance” transmission behavior. > > > > == Sleuthing & The Closing == > > > > In hind sight, seems simple and makes perfect sense right? However, when > your only visibility into the network is the Expressway servers themselves, > it can be *very* challenging to discover because at that point in the > topology, everything looks like it is coming from and going to the VIP on > the firewall pair. So how do you catch something like this when you can’t > see everything? *PCAPs*. *Literally counting f**king packet sequence > numbers for 6 hours and identifying a consistent pattern of packets coming > out of order and being “lost”.* > > > > -Ryan- > > > > > > _______________________________________________ > cisco-voip mailing list > [email protected] > https://puck.nether.net/mailman/listinfo/cisco-voip > >
_______________________________________________ cisco-voip mailing list [email protected] https://puck.nether.net/mailman/listinfo/cisco-voip
