I think Engineering Deathmatch should be more about troubleshooting than configuration. This is the type of stuff which separates the Engineers from the Installers.
On Mon, Apr 30, 2018 at 10:34 AM Ryan Huff <[email protected]> wrote: > It’s funny what we’ll do for “strategic customers”. Lol, ah well there was > Jack and Coke at the end of that rainbow for sure :). > > Sent from my iPhone > > On Apr 30, 2018, at 11:27, Anthony Holloway < > [email protected]> wrote: > > Yes, what James said, thank you for sharing this info. I think I would > have given up at "counting f**king packet sequence numbers." > > On Mon, Apr 30, 2018 at 10:13 AM James Buchanan <[email protected]> > wrote: > >> Painful as this was, hats off to you for writing this up and sharing. >> Much appreciated! >> >> On Mon, Apr 30, 2018 at 3:36 PM, Ryan Huff <[email protected]> wrote: >> >>> So here is a *neat* little situation I ran into recently, and is worth >>> sharing and reading; if this saves a life it was worth the crap I had to go >>> through ….. >>> >>> >>> >>> == The Scenario == >>> >>> >>> >>> - Expressway C/E 8.10.3 cluster over wan (2 Control Peers, 2 Edge >>> Peers) >>> - Customer deployed and managed SD-WAN solution in front of the Edge >>> cluster to the Internet (with two separate transport carriers). I think >>> it >>> was Palos, but we’ll call it a whitebox’ed solution for our purposes >>> - Using MRA and B2B Expressway configs >>> - UAT for MRA and B2B is accepted and works great >>> >>> >>> >>> == The Problem == >>> >>> >>> >>> The customer applies the zone/search rule config in Expressway for CMR >>> and notices that randomly, during a presentation session in the CMR, the >>> BFCP server (AKA, the WebEx meeting) will close the BFCP presentation to >>> the endpoint coming from the customer’s Expressway; all other BFCP clients >>> are still receiving the BFCP presentation. That’s right, it *appears* >>> that WebEx *kicked* the BFCP participant coming from the customer’s >>> Edge, but not because the BFCP server closed the session (all other >>> participants remain)! Although it was happening randomly’ish in length of >>> time into the presentation, it would always happen at some point to the >>> endpoint, generally around the 2 minute’ish mark. >>> >>> >>> >>> == The diagnosis == >>> >>> >>> >>> Although random, a consistent’ish length would seem to suggest a timer / >>> re-invite of some flavor, and that would be wrong, as ultimately uncovered. >>> Sparing you all the gory tales of escalation and vendor bus underskirt >>> sliding; the issue was in fact, the SD-WAN solution itself. >>> >>> >>> >>> == The Explanation & The Fix == >>> >>> >>> >>> What was happening is that every 120 seconds or so, the BFCP server >>> (WebEx meeting) would send a UDP BFCP packet to all the BFCP presentation >>> subscribers. The customer’s SD-WAN solution was *identifying* these >>> packets according to the customer (gotta love layer 7 capable firewalls >>> 😊) and queueing them onto a physically different link than which the >>> stream was on, thus creating *physical asymmetry, delay and latency*. I >>> specifically requested that all inspection capabilities be turned off for >>> the traffic but I guess that isn’t the same as “identifying the traffic” …. >>> Lol. In a TCP stream, this would likely be tolerated to a degree as packet >>> loss or delay and/or jitter and would simply re transmit ….. but we are >>> dealing with *UDP* here, no bueno. >>> >>> >>> >>> To resolve, the customer had to identify and classify the traffic and >>> force a active/failover transmission through the SD-WAN solution for that >>> traffic, rather than a “load balance” transmission behavior. >>> >>> >>> >>> == Sleuthing & The Closing == >>> >>> >>> >>> In hind sight, seems simple and makes perfect sense right? However, when >>> your only visibility into the network is the Expressway servers themselves, >>> it can be *very* challenging to discover because at that point in the >>> topology, everything looks like it is coming from and going to the VIP on >>> the firewall pair. So how do you catch something like this when you can’t >>> see everything? *PCAPs*. *Literally counting f**king packet sequence >>> numbers for 6 hours and identifying a consistent pattern of packets coming >>> out of order and being “lost”.* >>> >>> >>> >>> -Ryan- >>> >>> >>> >>> >>> >>> _______________________________________________ >>> cisco-voip mailing list >>> [email protected] >>> https://puck.nether.net/mailman/listinfo/cisco-voip >>> >>> >> _______________________________________________ >> cisco-voip mailing list >> [email protected] >> https://puck.nether.net/mailman/listinfo/cisco-voip >> >
_______________________________________________ cisco-voip mailing list [email protected] https://puck.nether.net/mailman/listinfo/cisco-voip
