On 5 July 2018 09:56:40 BST, adamv0...@netconsultings.com wrote:
>> Of James Bensley
>> Sent: Thursday, July 05, 2018 9:15 AM
>>
>> - 100% rFLA coverage: TI-LA covers the "black spots" we currently
>have.
>>
>Yeah that's an interesting use case you mentioned, that I haven't
>considered, that is no TE need but FRR need.
>But I guess if it was business critical to get those blind spots
>FRR-protected then you would have done something about it already
>right?
Hi Adam,
Yeah correct, no mission critical services are effected by this for us, so the
business obviously hasn't allocated resource to do anything about it. If it was
a major issue, it should be as simple as adding an extra back haul link to a
node or shifting existing ones around (to reshape the P space and Q space to
"please" the FRR algorithm).
>So I guess it's more like it would be nice to have, now is it enough
>to
>expose the business to additional risk?
>Like for instance yes you'd test the feature to death to make sure it
>works
>under any circumstances (it's the very heart of the network after all
>if
>that breaks everything breaks), but the problem I see is then going to
>a
>next release couple of years later -since SR is a new thing it would
>have a
>ton of new stuff added to it by then resulting in higher potential for
>regression bugs with comparison to LDP or RSVP which have been around
>since
>ever and every new release to these two is basically just bug fixes.
Good point, I think its worth breaking that down into two separate
points/concerns:
Initial deployment bugs:
We've done stuff like pay for a CPoC with Cisco, then deployed, then had it all
blow up, then paod Cisco AS to asses the situation only to be told it's not a
good design :D So we just assume a default/safe view now that no amount of
testing will protect us. We ensure we have backout plans if something
immediately blows up, and heightened reporting for issues that take 72 hours to
show up, and change freezes to cover issues that take a week to show up etc.
etc. So I think as far as an initial SR deployment goes, all we can do is our
best with regards to being cautious, just as we would with any major core
changes. So I don't see the initial deployment as any more risky than other
core projects we've undertaken like changing vendors, entire chassis
replacements, code upgrades between major versions etc.
Regression bugs:
My opinion is that in the case of something like SR which is being deployed
based on early drafts, regression bugs is potentially a bigger issue than an
initial deployment. I hadn't considered this. Again though I think its
something we can reasonably prepare for. Depending on the potential impact to
the business you could go as far as standing up a new chassis next to an
existing one, but on the newer code version, run them in parallel, migrating
services over slowly, keep the old one up for a while before you take it down.
You could just do something as simple and physically replace the routing
engine, keep the old one on site for a bit so you can quickly swap back. Or
just drain the links in the IGP, downgraded the code, and then un-drain the
links, if you've got some single homed services on there. If you have OOB
access and plan all the rollback config in advance, we can operationally
support the risks, no differently to any other major core change.
Probably the hardest part is assessing what the risk actually is? How to know
what level of additional support, monitoring, people, you will need. If you
under resource a rollback of a major failure, and fuck the rollback too, you
might need some new pants :)
Cheers,
James.
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp