Gavin Panella wrote: > I finished being build engineer last week. Here's a summary of the > things I did: > > * Fixed bug #422433 (Race condition when running two ec2test instances > very close together). This needed to be fixed before working on > making the test suite run in parallel across several machines. > > * Investigated a couple of bugs: #419421 (Buildbot: over time memory > usage of the buildbot master process gets unreasonable) and # > > * Got the jscheck builder running more frequently, and, after some > cajoling, got it to work. Michael Hudson did the ground work for > this. Fixing this problem taught be a lot about how buildbot works, > how to configure it, and meant I got to look at a *lot* of source > code.
Yay, another victim! > * With the help of the LOSAs, got another of mwhudson's lpbuildbot > branches, use-update-sourcecode, merged and rolled out. This removed > quite a lot of code from lpbuildbot and replaced it with a single > call to utilities/update-sourcecode. > > * Landed a lpbuildbot branch, avoid-deadlock, to fix a potential > problem in kill-test-pids where it could hang indefinitely. I can't > tell if this has ever affected us, but it was worth a small fix to > prevent it. > > * Prepared a lpbuildbot branch to fix bug #455737 (PYTHONPATH should > not be set when calling test_on_merge). This has been reviewed but > not merged. > > * Prepared a possible fix for bug #419408 (Buildbot: over time, > buildbot creates zombie processes) and bug #419408 (Buildbot: over > time, buildbot creates zombie processes). I think these two are > related; see comment 3 in bug 419408 for an explanation of the > possible culprit. Um, you think #419408 and #419408 are related? :) I think your explanation on the bug makes sense (just commented). > There are actually two branches related to this, the fix itself, and > a port to staging which rolls in the fix and other changes to the > production configs. Neither have been merged, but the fix has been > reviewed. > > * Investigated bug 433657 (tests regularly fail on buildbot with "no > space left on device"). Landed Launchpad branch log-statement-none > to disable PostgreSQL statement logging (which was set to 'all') to > see if that might help... but I haven't kept track of failures, so I > don't actually know. It should be possible to go back through the > build logs and figure it out. > > I also documented how to put statement logging back for those who > want it: https://dev.launchpad.net/Debugging > > RT #36179 has been filed to request disk space monitoring on the > slaves. It would be especially useful to get something like the disk > space usage report that baobab does when a disk fills up. > > * Branch ec2-buildout moves lib/devscripts to a separate place in the > tree, so that it's another develop egg. The biggest driver for doing > this was so that it could run with a different Python version. As of > next Monday that will cease to be an issue, but I think it's still > useful to treat it as a separate project. There's no need to > separate it from the Launchpad tree right now, but doing so would be > quite easy. > > This branch is unfortunately not quite finished; hooking in the > tests to run was proving a hassle, but I think there's a way around > that (using subunit, yay). Just got to do it :-/ > > I'm CHR next week so maybe I'll help the community by finishing this > ;) > > * My pet project was trying to get the test suite to split itself up > and run on several machines in parallel, to reduce run time. > > I didn't make much tangible progress on this until the last couple > of weeks of my stint - only a bus load of reading code and docs - > but, with a lot of help from jml including a 2-day sprint in London, > something good has come out of it. There's a branch in review - > lp:~allenap/launchpad/ec2-parry - that both jml and I worked on, and > jml has an alternative approach at lp:~jml/launchpad/dirty-parry. Ah yes, this is lurking like a menacing thing in my inbox currently... I will get to it, I promise! > See the cover letter in the ec2-parry merge proposal for an idea of > how it works. There are two outstanding issues to resolve before > it'll be generally useful: security around the RPC mechanism needs > tightening up, and there's a problem where workers are not running > all the tests. I'll be dogfooding this myself to try and figure > these out, but if there are any other masochists out there maybe we > can squash these issues quicker than I can on my own. > > Comments on being Build Engineer: > > * Getting started was daunting. Suddenly having to actually know about > PQM, buildbot, AWS/EC2, unittest, zope.testing, and so on, was a > learning cliff-face, but a few things got me through. Figuring out > the jscheck issue helped me understand buildbot and be, frankly, > less scared of it. But most of all, having mwhudson and jml to talk > to was probably the most reassuring thing. > > * For a lot of the BE stint I was fighting little fires (with my water > pistol of limited knowledge). I got an idea of what I imagine the > LOSAs feel like every day :-/ (Not the water pistol bit; the > fighting fires bit). > > I felt like I spent a lot of my time task-switching, and the lack of > tangible output was a bit of a downer. Coming after mwhudson, who > did a lot of build-related goodness, I put myself under a lot of > pressure to make a mark. Maybe I'm just used to the feeling of fighting lots of little fires :-) (I certainly seem to spend enough time doing it when I'm not BE). > I guess it's worth reminding future Build > Engineers that it's also about learning. The BuildEngineer wiki page > even states as an advantage that "Knowledge about the build system > is spread around the team." Yes, I think this is definitely part of the goal, so if you know more about the system now that's a success, even if you'd achieved nothing (not that it sounds like this was the case). > * I definitely think the BE role is worth it. It's a break from the > routine. I've learnt a ton that I can bring back to my normal role > in Bugs. I think I've made improvements to the build side of > Launchpad (though I wish I could have made more). > > * Michael said in his report that "It's hard to get things done on the > infrastructure in week 4!". It was difficult to get things done in > the last *three* weeks of the 3.1.10 cycle because there was new > hardware, U1, Karmic, and a Launchpad release. Oof. I don't envy you that :) > Especially when it > comes to buildbot and PQM, much of the BE's role is LOSA intensive, > and, as I've already fed back to Gary, I didn't feel like I had the > right to push for attention from them for BE fixes (excepting > show-stoppers). As spm said, I think it's always worth asking. Though for most of the last cycle the answer would have been "no". > * Michael also said "not being able to land branches in week 4 is a > pain, even more than normal", and "... the build engineer's work is > sort of sideways to the main thrust of launchpad development". > > It might be beneficial if the BE role was 2 weeks out of sync with > the normal development cycle. Hm, that's an attractive idea. The downside that I see is that it might torpedo two cycles of the BE's "normal" development rather than just one. > * I did do some Bugs work during my stint. It just had to be done, but > it was probably <5% of my time. Yeah, I think this in inevitable. > * I wish I was as concise as mwhudson. I don't think my terseness is a uniformly good thing :-) > Have a good stint stub! Indeed! Gavin and I are here to share your pain :) Cheers, mwh _______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : [email protected] Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp

