On Wed, Jul 3, 2013 at 2:59 PM, Maglinger, Paul <[email protected]> wrote: > They may think the customer doesn’t notice, but I’m willing to bet the > average customer calls their ISP when there’s a problem with NetFlix and by > the time they get done rebooting their cable modem and router the service is > back up. “Average customer” then either blames their equipment or their > ISP.
It helps to understand the environment Netflix is running. They're running almost everything in Amazon's cloud service, where they basically can't depend on any given node or cluster not disappearing without warning. So, rather than try to find a cloud host that's bullet-proof *and* affordable, they design their software architecture to withstand the inevitable cloud failures. Chaos Monkey just helps to keep the developers from assuming the host won't fail. But this just addresses problems with a host platform failure. There's plenty more problem domains that it doesn't impact. In particular, software that does the wrong thing (rather than just becomes unavailable) can and does still cause outages. -- Ben

