Re: Global Akamai Outage

Jared Mauch Sun, 25 Jul 2021 08:14:14 -0700

Work hat is not on, but context is included from prior workplaces etc. 

> On Jul 25, 2021, at 2:22 AM, Saku Ytti <[email protected]> wrote:
> 
> It doesn't seem like a tenable solution, when the solution is 'do
> better', since I'm sure whoever did those checks did their best in the
> first place. So we must assume we have some fundamental limits what
> 'do better' can achieve, we have to assume we have similar level of
> outage potential in all work we've produced and continue to produce
> for which we exert very little control over.


I have seen a very strong culture around risk and risk avoidance whenever 
possible at akamai. Some minor changes are taken very seriously. 

I appreciate that on a daily basis, and when we make mistakes (I am human after 
all) are made, reviews of the mistakes and corrective steps are planned and 
followed up on. I'm sure this time will not be different. 

I also get how easy it is to be cynical about these issues. There's always 
someone with power who can break things, but those can also often fix them just 
as fast. 

Focus on how you can do a transactional routing change and roll it back, how 
you can test etc. 

This is why for years I told one vendor that had a line-by-line parser their 
system was too unsafe for operation. 

There's also other questions like:

How can we improve response times when things are routed poorly? Time to 
mitigate hijacks is improved my majority of providers doing RPKI OV, but 
interprovider response time scales are much longer. I also think about the two 
big CTL long haul and routing issues last year. How can you mitigate these 
externalities. 

- Jared

Re: Global Akamai Outage

Reply via email to