On Mon, Dec 16, 2013 at 11:49:44AM +0100, Vincent Ladeuil wrote: > Hi, > > So, I previously setup jenkins jobs > (http://q-jenkins.ubuntu-ci:8080/job/autopilot-trusty-daily_release/) > using otto to stop containers left running as a catch-all to solve the > deadlock of otto checking for a running container before attempting a > new job. In other words, the design was such that if a job left a > container running, no other jobs could be attempted. The workaround is > to make such jobs check for a container as a Post Build task that is run > even if the job times out or is aborted. > > This worked. For some time. > > A new case has appeared last Friday > (https://wiki.canonical.com/UbuntuEngineering/CI/IncidentLog/2013-12-13-qa-intel-4000-kernel-crash) > where 'lxc-stop' would hang for (yet) unknown reasons > (https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1261338 filed). > > The only way I could find to get the host back to a working state was to > reboot it :-/ > > So, after trying 'lxc-stop -t <timeout>' which also hang :-/ I've > settled to: > > modified jenkins/stop_running_container > > > === modified file 'jenkins/stop_running_container' > --- jenkins/stop_running_container 2013-12-16 08:56:03 +0000 > +++ jenkins/stop_running_container 2013-12-16 10:04:30 +0000 > @@ -35,13 +35,9 @@ > for c in ${RUNNING_CONTAINERS} ; do > echo "W: Will stop '$c' left running and blocking further otto jobs" > # Make sure we'll continue even if the container is not running anymore > - set +e > - # Stop the container by nuking it, codename: Little Boy > - sudo lxc-stop -k -t 120 -n $c > - ret=$? > - if [ $ret -ne 0 ]; then > - # This wasn't enoug, use a more powerful nuke, codename: Fat Man > - (echo "Couldn't stop the container, reboot..."; sleep 20; sudo > reboot)& > - fi > - set -e > + > + # Since 'lxc-stop', 'lxc-stop -k' fail in some contexts, and that > + # 'lxc-stop -t <timeout>' can hang, just use reboot > + echo "Couldn't stop the container, rebooting..." > + sudo shutdown -r now > done > > And cherry-picked that change in > http://q-jenkins.ubuntu-ci:8080/job/autopilot-trusty-daily_release. > > Note that lp:~vila/otto/stop-running-container is not deployed on all > otto nodes, keep that in mind when deploying further changes there. > > I'm seeking feedback from the team for better ideas on how to better > recover from such failures. Some already identified leads being: > > - use a kvm with pass trough graphic card so the host is immune to > related crashes (long term, requires significant changes in otto), > > - better track the causes of containers left running in otto itself to > we rely on the catch-all less and less as they are fixed. > > I've added Stephane in CC for feedback on lxc itself, it's a bit weird > that there is no way to forcefully stop a container or at least get an > error (and not hanging) when this happened.
Did you try "lxc-stop -n <container> -k" which is the upstream supported way of forcefully killing a container? In theory lxc-stop sends SIGPWR, then waits 30s and sends SIGKILL to init. If SIGKILL doesn't work, then you have much bigger problems (typically kernel related). So please try with -k, if that doesn't work, please let me access one of those hanging machines so I can confirm that it's not an LXC issue and that something in the kernel is indeed making one of the tasks unkillable. > > Vincent -- Stéphane Graber Ubuntu developer http://www.canonical.com
signature.asc
Description: Digital signature
-- Mailing list: https://launchpad.net/~canonical-ci-engineering Post to : [email protected] Unsubscribe : https://launchpad.net/~canonical-ci-engineering More help : https://help.launchpad.net/ListHelp

