Re: [j-nsp] Segment Routing Real World Deployment (was: VPC mc-lag)

James Bensley Fri, 06 Jul 2018 07:32:51 -0700

On 5 July 2018 09:56:40 BST, adamv0...@netconsultings.com wrote:
>> Of James Bensley
>> Sent: Thursday, July 05, 2018 9:15 AM
>> 
>> - 100% rFLA coverage: TI-LA covers the "black spots" we currently
>have.
>> 
>Yeah that's an interesting use case you mentioned, that I haven't
>considered, that is no TE need but FRR need.
>But I guess if it was business critical to get those blind spots
>FRR-protected then you would have done something about it already
>right?

Hi Adam,

Yeah correct, no mission critical services are effected by this for us, so the 
business obviously hasn't allocated resource to do anything about it. If it was 
a major issue, it should be as simple as adding an extra back haul link to a 
node or shifting existing ones around (to reshape the P space and Q space to 
"please" the FRR algorithm).

>So I guess it's more like it would be nice to have,  now is it enough
>to
>expose the business to additional risk? 
>Like for instance yes you'd test the feature to death to make sure it
>works
>under any circumstances (it's the very heart of the network after all
>if
>that breaks everything breaks), but the problem I see is then going to
>a
>next release couple of years later -since SR is a new thing it would
>have a
>ton of new stuff added to it by then resulting in higher potential for
>regression bugs with comparison to LDP or RSVP which have been around
>since
>ever and every new release to these two is basically just bug fixes.   

Good point, I think its worth breaking that down into two separate 
points/concerns:

Initial deployment bugs:
We've done stuff like pay for a CPoC with Cisco, then deployed, then had it all 
blow up, then paod Cisco AS to asses the situation only to be told it's not a 
good design :D So we just assume a default/safe view now that no amount of 
testing will protect us. We ensure we have backout plans if something 
immediately blows up, and heightened reporting for issues that take 72 hours to 
show up, and change freezes to cover issues that take a week to show up etc. 
etc. So I think as far as an initial SR deployment goes, all we can do is our 
best with regards to being cautious, just as we would with any major core 
changes. So I don't see the initial deployment as any more risky than other 
core projects we've undertaken like changing vendors, entire chassis 
replacements, code upgrades between major versions etc.

Regression bugs:
My opinion is that in the case of something like SR which is being deployed 
based on early drafts, regression bugs is potentially a bigger issue than an 
initial deployment. I hadn't considered this. Again though I think its 
something we can reasonably prepare for. Depending on the potential impact to 
the business you could go as far as standing up a new chassis next to an 
existing one, but on the newer code version, run them in parallel, migrating 
services over slowly, keep the old one up for a while before you take it down. 
You could just do something as simple and physically replace the routing 
engine, keep the old one on site for a bit so you can quickly swap back. Or 
just drain the links in the IGP, downgraded the code, and then un-drain the 
links, if you've got some single homed services on there. If you have OOB 
access and plan all the rollback config in advance, we can operationally 
support the risks, no differently to any other major core change.

Probably the hardest part is assessing what the risk actually is? How to know 
what level of additional support, monitoring, people, you will need. If you 
under resource a rollback of a major failure, and fuck the rollback too, you 
might need some new pants :)

Cheers,
James.
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: [j-nsp] Segment Routing Real World Deployment (was: VPC mc-lag)

Reply via email to