Re: Validating multi-path in production?

Jeff Tantsura Fri, 12 Nov 2021 13:48:19 -0800

LAG - Micro BFD (RFC7130) provides per constituent livability. MLAG is much 
more complicated (there’s a proposal in IETF but not progressing), so LACP is 
pretty much the only option.
ECMP could use old/good single hop BFD per pair.
Practically - if you introduce enough flows with one of the hash keys 
monotonically changing, eventually you’d exercise every path available;
on itself would not help for end2end testing, usually integrated with a form of 
s/net flow to provide “proof of transit.
Inband telemetry (chose your poison) does provide basic device ID it has 
traversed as well as in some cases POT. 
Finally - there are public Microsoft presentations how we use IPinIP encap to 
traverse a particular path on wide radix ECMP fabrics.


Cheers,
Jeff

> On Nov 12, 2021, at 07:55, Adam Thompson <athomp...@merlin.mb.ca> wrote:
> 
> 
> Hello all.
> Over time, we've run into occurrences of both bugs and human error, both in 
> our own gear and in our partner networks' gear, specifically affecting 
> multi-path forwarding, at pretty much all layers: Multi-chassis LAG, ECMP, 
> and BGP MP.  (Yes, I am a corner-case magnet.  Lucky me.)
> 
> Some of these issues were fairly obvious when they happened, but some were 
> really hard to pin down.
> 
> We've found that typical network monitoring tools (Observium & Smokeping, not 
> to mention plain old ping and traceroute) can't really detect a 
> hashing-related or multi-path-related problem: either the packets get through 
> or they don't.
> 
> Can anyone recommend either tools or techniques to validate that multi-path 
> forwarding either is, or isn't, working correctly in a production network?  
> I'm looking to add something to our test suite for when we make changes to 
> critical network gear.  Almost all the scenarios I want to test only involve 
> two paths, if that helps.
> 
> The best I've come up with so far is to have two test systems (typically VMs) 
> that use adjacent IP addresses and adjacent MAC addresses, and test both 
> inbound and outbound to/from those, blindly trusting/hoping that hashing 
> algorithms will probably exercise both paths.
> 
> Some of the problems we've seen show that merely looking at interface 
> counters is insufficient, so I'm trying to find an explicit proof, not 
> implicit.
> 
> Any suggestions?  Surely other vendors and/or admins have screwed this up in 
> subtle ways enough times that this knowledge exists?  (My Google-fu is 
> usually pretty good, but I'm striking out - maybe I'm using the wrong terms.)
> 
> -Adam
> 
> Adam Thompson
> Consultant, Infrastructure Services
> 
> 100 - 135 Innovation Drive
> Winnipeg, MB, R3T 6A8
> (204) 977-6824 or 1-800-430-6404 (MB only)
> athomp...@merlin.mb.ca
> www.merlin.mb.ca

Re: Validating multi-path in production?

Reply via email to