I spent some time on that issue during my Vanguard shift, see https://app.asana.com/0/8740321118011/8740321118013 for details, I'll raise some points and ideas below as this is more work than I thought and seems worth discussing the various issues and solutions.
>>>>> Vincent Ladeuil <[email protected]> writes: > Hi, > I've discussed that with jamespage and came up with the following > workaround: > modified debian/jenkins-slave.upstart > === modified file 'debian/jenkins-slave.upstart' > --- debian/jenkins-slave.upstart 2013-02-17 17:11:13 +0000 > +++ debian/jenkins-slave.upstart 2013-12-09 10:29:01 +0000 > @@ -17,3 +17,6 @@ > exec start-stop-daemon --start -c $JENKINS_USER --exec $JAVA --name jenkins-slave \ > -- $JAVA_ARGS -jar $JENKINS_RUN/slave.jar $JENKINS_ARGS > end script > + > +# respawn if the slave crash > +respawn > I've deployed that on jatayu by adding 'respawn' to > /etc/init/jenkins-slave.conf so daily-release-executor should now > restart automatically (I've restarted the jenkins-slave service). > .... >>>>> Francis Ginther <[email protected]> writes: > Vila, > My recommendation is to deprecate /usr/local/bin/start-jenkins-slaves > and rely on individual upstart jobs, one for each slave. >>>>> Larry Works <[email protected]> writes: > I second the motion for upstart jobs for each individual node. Looks like we have a consensus on not using /usr/local/bin/start-jenkins-slaves. >>>>> Larry Works <[email protected]> writes: > I also would't mind seeing us get away from using SSH to restart > remote nodes since that will allow us to eliminate another plugin > (or three). Can you elaborate on that ? By 'using SSH to restart remote nodes' you mean us connecting via ssh and restarting the slaves manually ? Probably not as I fail to see the link with plugins... >>>>> Evan Dandrea <[email protected]> writes: > On 9 December 2013 13:38, Vincent Ladeuil <[email protected]> wrote: >> === modified file 'debian/jenkins-slave.upstart' >> --- debian/jenkins-slave.upstart 2013-02-17 17:11:13 +0000 >> +++ debian/jenkins-slave.upstart 2013-12-09 10:29:01 +0000 >> @@ -17,3 +17,6 @@ >> exec start-stop-daemon --start -c $JENKINS_USER --exec $JAVA --name jenkins-slave \ >> -- $JAVA_ARGS -jar $JENKINS_RUN/slave.jar $JENKINS_ARGS >> end script >> + >> +# respawn if the slave crash >> +respawn > respawn limit (http://upstart.ubuntu.com/cookbook/#respawn-limit) > please. Yup, that was (and still is) on my radar, see https://app.asana.com/0/8740321118011/9113941145531 . > Otherwise we will poorly handle the case where the slave is broken > (remember the corrupted jar?) and cannot actually be started. I vaguely remember but no details, what was the symptom, how can we automate a check for that ? See https://app.asana.com/0/8740321118011/9113941145533 for a proposal to check the jar validity, feedback welcome. Now, I stopped counting at 40 when listing all nodes where we want to do that (see https://app.asana.com/0/8740321118011/9113941145537). 40 is too high for a manual fix and deploy strategy :-/ And at that point I wonder if we really want to keep using jlnp or if it's worth chosing a different way to connect to the slaves. jenkins proposes two other methods: - launch slave agents on Unix machines by using ssh - launch slave via execution of command on the Master My understanding (and practice on http://babune.ladeuil.net:24842) is that the master can (and will) restart the connection when needed (including when it's lost), so it may be a better fit[1] than addressing all the issues we're encountering with jlnp. Thoughts ? In a nutshell, I feel that we'd be better served in the short term by restarting the crashed slaves manually with an option of adding 'respawn' when we do that ; and post-pone the better resolution. Vincent [1]: That needs to be tested first of course. -- Mailing list: https://launchpad.net/~canonical-ci-engineering Post to : [email protected] Unsubscribe : https://launchpad.net/~canonical-ci-engineering More help : https://help.launchpad.net/ListHelp

