Stephanie Daugherty writes:
Service failures should be extraordinary events, and we should strive to keep treating them as such, so that we continue to pursue stability. Restarting a service automatically doesn't improve stability of that software, it works around an instability rather than addressing the root cause - it's a band-aid over a festering wound.

Unix has a few design choices that tend to produce problems like these, such as malloc() and its c++ cousin "operator new".

Malloc() is very simple: You ask for memory and get it. The negative side of that simplicity is that if you're out of memory (and that happens occasionally if a server is run close to capacity) then processes die and/or become unresponsive. Such is the tyranny of the Poisson distribution.

The failure of a service is analogous in my eyes to the tripping of a circuit breaker - it happened for a reason, and that underlying reason is probably serious.

Pick your poison: Restart services or add failure handling around all malloc() calls. I quite like the former in many cases, even though it papers over various unintentional problem as well as provide the intentional simplification. But then I like TCP better than NCP, etc.

Arnt

_______________________________________________
Dng mailing list
[email protected]
https://mailinglists.dyne.org/cgi-bin/mailman/listinfo/dng

Reply via email to