On 8 May 2015 at 17:34, Thomi Richards <[email protected]> wrote: > I agree that there's deeper complexity here: What caused the key-pairs to > leak? However, our priority is keeping the system from catastrophic failure. > I simply don't care what caused the key-pairs to leak when all the jobs are > failing: I just want to be alerted, and to fix the immediate problem to > hand.
There are many cases like this. Today it is keypairs. Tomorrow it is security groups. Thursday keystone will go down. Should we hunt down each of these individually? Would we not cover more ground and save ourselves considerable work if we didn't just let things fall into the dead letter queue? Pop the top message. See that we're out of keypairs. Fix that. Blindly throw everything from the dead letter queue back to the front of the system. Wait to see if anything else dead letters. If we've actually fixed the leak, we haven't needed to add any monitoring code (complexity) and we won't be troubled with queue processing more than once. > This scenario will happen again. Without monitoring, our workers will start > failing 100% of jobs passed to them, which will halt proposed-migration from > running. I'd like to make that less likely to happen. Why aren't these just going to the dead letter queue where they can be manually reviewed and thrown back into the mix? > We should be building systems that we can diagnose and fix *before* our > stakeholders call us and say "proposed migration has been broken for 2 days > now". Agreed. If I may tweak slightly, we should be building systems that let us diagnose and fix /at our own pace/, without our stakeholders ever seeing an error message. We should only give them results when there are results to act on, and infrastructure failure is not something they can act on. If Jenkins timeouts are fucking us here, let's fix Jenkins. Or, you know, burn it. >> We are only going to create false confidence and waste time if we take >> the position that the more holes we plug, the safer we will be. > > To me, this boils down to "we'll never catch everything, so why bother > trying to catch anything". Not at all. I am trying to suggest that instead of leaping to solve every narrowly defined bug, we first see if we can bucket a larger group simply, thus with minimal effort and complexity. Let's be clever about how we architect for failure. You guys are too smart to focus on the immediate problem instead of the bigger picture, to let the system constantly put you on the back foot in a game of whack-a-mole. > Also, yes I think that the more holes we plug, the safer we will be. There > are many weird and wonderful ways systems can fail, but we can, over time, > make the system more robust. Can we ever predict and account for everything? > of course not. We can make things better though. Call me an optimist if you > want, but I don't think there's literally an infinite number of ways the > system can fail. It think there's a finite upper bound. Therefore, any > situations we can monitor and alert on, the more robust the system will be. You mistake my realism for pessimism. I am not saying we should throw up our hands and declare quality a lost cause. I am suggesting we accept that there are more instances of failure that we can anticipate than we have time to individually address, and that there is a whole class of unknown failures for which we can do nothing specific. If we fix individual problems and ignore the unknowns, our optimism creates that false sense of security and a misperceived path towards greater stability. Let us focus our efforts on solutions that acknowledge the unknown classes of failure, that this is not some Newtonian system that can be easily reduced to a finite set of states. > What happens when we run out of keypairs late on a Friday night? What > happens if we're all busy on some new sprint and don't think to monitor the > dead letter queue? See below. > But we don't inspect the dead letter queue at all! Yesterday we had 361 > messages in the adt deadletters queue. Today there's 5. I suspect we dumped > the queue because it was getting too big. I agree that if we had some easy > way of inspecting the queue and investigating issues we might catch these > problems sooner, but we don't, and haven't even planned for one. I would argue that's the biggest problem here. We should send the dead letter queue size to logstash, and connect PagerDuty to the latter. Any objections to getting that in as extra work for the next sprint? -- Mailing list: https://launchpad.net/~canonical-ci-engineering Post to : [email protected] Unsubscribe : https://launchpad.net/~canonical-ci-engineering More help : https://help.launchpad.net/ListHelp

