On 12 December 2013 14:46, Paul Larson <[email protected]> wrote: > On Thu, Dec 12, 2013 at 4:01 AM, Evan Dandrea > <[email protected]> wrote: >> - Siva mentioned that the expected device wasn't appearing in `adb >> devices`. Can we have a nagios check for this so we know sooner? > As for detection, we should investigate if there's a good way to do > this in nagios or in the jobs themselves. I think it sounds feasible > but bad device detection may be better integrated into the jobs rather > than relying on an external service that doesn't know what state > things are expected to be in.
Agreed. Nagios is really just a means of alerting based on some condition. The jobs could handle identifying when something has gone awry, tell Jenkins to hold the line, and drop a hint to nagios (a file in an expected location). > Another thing that should help is the > megajob refactor that Andy has been working on. It would at least > deal better with a situation where we lose a device and not require > regenerating all the jobs to get things moving again. After this goes > in, I'd like to see about adding some sort of a health check that > figures out if the device is at least reachable, and marks it > bad/offline if not. Before that though, we need all the bits in place > to detect the image on it and reflash if not. Paul, are you happy to take a task for the health check, pending the refactor? Can you have it drop a file to hint to nagios that a phone is dead (removing that file when things are clear)? Where do we stand on the megajob refactoring, Andy? > I think there are some things we could do to improve this (see above) > and continue to look for new ways to make it as reliable as the > devices will allow us to make it. Thanks Paul! -- Mailing list: https://launchpad.net/~canonical-ci-engineering Post to : [email protected] Unsubscribe : https://launchpad.net/~canonical-ci-engineering More help : https://help.launchpad.net/ListHelp

