On Sat, May 9, 2015 at 9:52 AM, Evan Dandrea <[email protected]> wrote:
> The problem is not that root cause analysis is hard in complex > systems. The problem is that there is no root cause of failure in > complex systems. Newton's Third Law only works when you can isolate > down to singular inputs, but that is necessarily not what we're > dealing with when watching a process influenced by its environment > (think of the Butterfly Effect). It is non-deterministic. > > Who cares about root cause analysis? I just want to stop things breaking. :D I think my anecdote demonstrates a clear example of where and how this would help us. We had a failure, it's cause was easy and obvious to isolate once it was brought to my attention, and we would have caught that failure before it resulted in the system failing if we had monitoring & alerting in place. I agree that there's deeper complexity here: What caused the key-pairs to leak? However, our priority is keeping the system from catastrophic failure. I simply don't care what caused the key-pairs to leak when all the jobs are failing: I just want to be alerted, and to fix the immediate problem to hand. This scenario *will* happen again. Without monitoring, our workers will start failing 100% of jobs passed to them, which will halt proposed-migration from running. I'd like to make that less likely to happen. We should be building systems that we can diagnose and fix *before* our stakeholders call us and say "proposed migration has been broken for 2 days now". We are only going to create false confidence and waste time if we take > the position that the more holes we plug, the safer we will be. > > To me, this boils down to "we'll never catch everything, so why bother trying to catch anything". Monitoring breeds false confidence in exactly the same way as logging does. If you assume that the monitoring & logging systems are giving you a 100% accurate picture of the systems running then you're going to get hurt... so don't do that. Instead, treat both these systems as hints to point you in the right direction. They aid early detection of problems, they're not a panacea. Also, yes I think that the more holes we plug, the safer we will be. There are many weird and wonderful ways systems can fail, but we can, over time, make the system more robust. Can we ever predict and account for everything? of course not. We can make things better though. Call me an optimist if you want, but I don't think there's literally an infinite number of ways the system can fail. It think there's a finite upper bound. Therefore, any situations we can monitor and alert on, the more robust the system will be. > There's a myriad of ways that work can end up dead-lettered. Work in > such a holding pattern can be investigated by a human (at their > leisure) and thrown back onto the queue when the problem is fixed, all > without the end user ever noticing. > What happens when we run out of keypairs late on a Friday night? What happens if we're all busy on some new sprint and don't think to monitor the dead letter queue? Why bother to isolate and report > on each individual variant of these when we will always have to go to > the queue and inspect the messages? > But we don't inspect the dead letter queue at all! Yesterday we had 361 messages in the adt deadletters queue. Today there's 5. I suspect we dumped the queue because it was getting too big. I agree that if we had some easy way of inspecting the queue and investigating issues we might catch these problems sooner, but we don't, and haven't even planned for one. I'd like to see some alternative suggestions for how to solve the problem of being alerted when things start breaking. Manually monitoring logs or a rabbit queue doesn't count - we surely don't want to start adding manual log parsing to our list of day-to-day activities? Perhaps there's a solution involving something that backs on to the dead letter queue, but I'd like to see a concrete proposal before discarding the approach outlined in my initial email. Cheers, -- Thomi Richards [email protected]
-- Mailing list: https://launchpad.net/~canonical-ci-engineering Post to : [email protected] Unsubscribe : https://launchpad.net/~canonical-ci-engineering More help : https://help.launchpad.net/ListHelp

