On 9/15/2014 4:15 PM, Tom Metro wrote: > Not to say your points are invalid, but Netflix would disagree with you. > They created a testing tool that intentionally kills random services on > their production systems just to test that automated recovery works > correctly.
Netflix is a highly available application system that is designed to be robust in the face of isolated faults and to degrade gracefully under failure conditions. Chaos Monkey is the tool that they use to test the implementations of their designs. It works by shutting down random Netflix-owned instances within the AWS scalable architecture. Automated recovery in the Netflix environment is simple: spin up a new instance that is configured identically to the one that failed. They don't try to restart the faulted instance. It's down for the count and it stays that way so they can analyze the fault that knocked it out. This is a /very/ different scenario from what you might have with a single LAMP instance where systemd keeps restarting MySQL after a persistent fault of some sort keeps knocking it out. This isn't automated recovery; it's an automated disaster looking to wreck your tables. -- Rich P. _______________________________________________ Discuss mailing list [email protected] http://lists.blu.org/mailman/listinfo/discuss
