1. Evan juju upgrades the deployed test runner component with some broken code. 2. A ticket comes through to the test runner worker and it crashes. Because the worker didn't ack this message, the ticket goes back on the queue. 3. Round and round it spins. It comes up to a worker again, fails, and goes back in the queue.
Now, we could leave it in this state forever and let the user come to us to say that the ticket appears wedged, but... With each new attempt, the test runner worker reports an OOPS for failure to process that message in the queue. We can then deal with this *asynchronously.* Here is the cool part: We juju upgrade the deployed test runner component again and the ticket escapes the loop. The test runner finishes and passes the ticket onto the next step. We didn't have to retry or resubmit an entire ticket. The work just sat there waiting for the environment to get better so it could continue. It wasn't a stop the line event. We could deal with it without worrying that a component was down and UE was losing development time because they couldn't submit new tickets. Questions: - Does this sound sensible? How do we know when to tell Nagios that the Vanguard needs to be contacted? On the first OOPS, or some other condition? - This only saves us when we get as far as the Rabbit event loop. We'll have to invent some sort of watchdog for the case when the process dies prior to that point. What should that look like? - What's unaccounted for? -- Mailing list: https://launchpad.net/~canonical-ci-engineering Post to : [email protected] Unsubscribe : https://launchpad.net/~canonical-ci-engineering More help : https://help.launchpad.net/ListHelp

