A small anecdote from my afternoon, that taught me a few lessons:
Today workers started failing all tests. First it was one worker, then two workers, then a bit later it was three workers. I happened to be looking at the jenkins jobs, and caught the error message as it went past. However, that was pure luck. I could easily imagine a scenario where no one was looking at the jenkins jobs, and it might take 24 hours to realise that proposed-migration was horribly backed up. First thought: We might have seen it if we were tracking 'pass rate per worker' - we'd have seen one worker spike to 100% fail rate. We could alert on that, even. After diagnosing it, it turns out the problem was that we'd slowly been leaking nova keypairs, and had hit our quota limit. This was easy to fix (I deleted the unused keypairs), but it got me thinking... Second Thought: We should be monitoring everything where there's a hard quota in place. We could easily track 'num keypairs left before the world ends', and alert if that got below `N`. To my mind, we monitor disk space, and keypairs are a similar category: * We have a known hard quota. * It's a pretty catastrophic failure when we run out. * Measuring how many you have left is reasonably trivial. If I may be permitted to jump into full engineering-implementation mode for a second.... Imagine a generic service that simply exposes statistics for the openstack tennant it's deployed in? We coudl then deploy one in every tennant we use, and have the stats monitoring system read those.... Anyway, I thought that was an interesting experience :D -- Thomi Richards [email protected]
-- Mailing list: https://launchpad.net/~canonical-ci-engineering Post to : [email protected] Unsubscribe : https://launchpad.net/~canonical-ci-engineering More help : https://help.launchpad.net/ListHelp

