= Attendance = The gang's all here: bac, benji, frankban, gmb, gary_poster
= Project plan = - We have two 24 core machines coming, one each for devel and dbdevel: yay - We have not tested regularly with that many cores: boo. We need to start testing regularly with 32 core EC2 machines when we have them. This will increase our bug count. - Encountered some road bumps this week that really slowed us down and introduced some serious problems (bug 1004088, bug 1003206). We won't have the statistics we hoped for by our checkpoint next week. = New tricks = Nobody mentioned any, but people might be interested in benji's new terminal tricks: https://dev.launchpad.net/yellow/Termbeamer . It's nicely packaged and ready for beta testers. Ever thought about sharing a terminal session over Jabber/GTalk? :-) = Any nice successes? = - gary_poster: Apache analysis for bug 1004088 by gmb and bac was our big success this week. bac & gary_poster: another win for collaboration. - frankban: lxc-start-ephemeral fix, including lxc-ip, is almost ready. [Note that, since this meeting, I've discovered that lxc-ip will no longer be used, in favor of some bash code for now and an improved lxc-info later. That's a disappointment, but should be a good end state for lxc.] = Problems = - we had a hard time diagnosing bug 1004088. (Take a glance at it if you are curious: https://bugs.launchpad.net/launchpad/+bug/1004088) benji: something to remember/note is that, if you are dealing with concurrent processes, linear divide and conquer approaches fail. frankban: since we are using lxc-ip, when we log in the only thing that has definitely started up is the network stack, not Apache. Expecting the machine to be fully initialized when we started our work was one proximate cause of the problem, and of why it was difficult to diagnose. benji: we have this problem of being surprised when environments are not as ready as we expect frequently (juju etc.): when we write code waiting for something, we should remember to ask "what is our definition of ready? does it match the definition of what we are waiting for?" - gmb: zope.testing changes for bug 996729 broke devel because we broke the subunit stream so ec2 test couldn't work properly (bug 1003696). This is also causing us to have to do rework. When we make fundamental changes to infrastructure...what do we need to do to keep this from happening? try it first? that seems like a platitude, and we did try it; we just were not careful enough. It is good that we currently have a test runner that does not use subunit, because it caught the problem: it could have been much worse, with buildbot/pqm accepting broken branches into stable. When we switch to parallel testing, we will be relying even more heavily on subunit. What can we do to provide a catch for this kind of problem in the future? benji: Maybe we could have a minimum number of tests that must pass in buildbot? gary_poster: Maybe have a maximum negative diff between landings? We could do something like this in buildbot and maybe ec2 also. ACTION ITEM! File a bug that parallel testing *and* ec2 should have a minimum number of tests to expect, or a maximum negative diff (i.e., a given run should not have a number of tests fewer than [number of tests in the last run] -100). If we want ec2 to have a maximum negative diff, ec2 needs some way of getting the last test run's test count, such as from buildbot's webservice. - benji: This is the second time we broke stuff by changing zope.testing. For the previous failure, we incorrectly cleaned up someone else's mess. We tried to prevent that mess in the future by announcing the problem on the mailing list, but that's probably not really a real solution. benji: instead, we could have a comment at the top of the versions file about the process to follow if you are using a custom-built version of code, and point to wiki page. We like this. ACTION ITEM! - bac: We decided not to support our buildbot juju charm, after discussion with Robert and Francis, and so it will leave the charm repo and die because the ~charmers reasonably don't think it has enough need (given preference for jenkins). Does this mean that we made a mistake in creating the Juju charm? We agree that it was a net positive, given our increased Juju experience, the feedback we were able to give to the Juju team, the Python charm helpers than Clint intends to package, and the python-shelltoolbox that Clint intends to package and sponsor in Ubuntu. Moreover, we brought value along the way, and, in the lean philosophy, it is fine to discard steps later that were productive at the time. But, we didn't question this because it was a directive of the project to use buildbot. Perhaps in future we should question directives? Questioning all the requirements given to us is annoying and counterproductive to our clients. However, sometimes it is a good idea. Perhaps we should question requirements among ourselves first. Before we consider bringing it up to the customer, and before we spend much time on analysis, we should make a rough plan and estimate for answering those questions. If getting the answer is relatively cheap, and/or if the answer is potentially important, go ahead and raise the question with the customer, including the rough plan we've assembled to answer it. ACTION ITEM?? Should we make a checklist for starting a project? - gary_poster: ACTION ITEM: I should mention to Francis that LP should maybe maintain the charms for as long as we use buildbot. It sure is nice to be able to quickly fire off a buildbot environment to test changes in. - gmb: “Having root makes you stupid.” We all have ignored the issues with make clean and /var/tmp/bazaar.launchpad.dev in the past because on our machines we could (bug 1004088). Whenever you are about to use a big hammer on a problem, stop and think if you can use a little hammer. gary_poster: If you encounter an annoyance and investigating it now doesn’t make sense, add a slack task idea card. That might help you remember to investigate later when you have a moment, and still let you get your active card done now. ACTION ITEM?? Should we have a checklist of what to do when we encounter an annoyance? Can we do something else to turn this into a process? - gary_poster: Our juju charm tests have bitrotted. Why didn't we know sooner? We had automated tests that were supposed to be run by the charm repository, but the regular runs are not ready yet. - bac: Similarly, why did we not see the problem sooner for bug 1004088? The buildbot change that triggered the bug was Friday. gary_poster: Because no real automatic testing (Gary is the automatic testing system), and because we had multiple big issues at once (also 1003696 and 1003206 as fallout from 996729. We tried to set up the automatic testing earlier but broke out of the timebox. :-/ - bac: making experimental changes is really hard to get through our environment: lpsetup ppa is hard, and ties with LP code changes. benji: complexity of things interconnecting is a common source of these annoyances. gary: rich hickey/clojure calls that complecting. francesco: lpsetup could have a configuration file. benji: call chain visualization would be nice, but we are talking about projects interconnecting, not code internally interconnecting. How do we identify these sorts of problems early? benji: have a requisite boxes and lines diagram? maybe too much. gary_poster: Another direction: if we wait on the computer for more than a minute to do something (for example), this is a problem for the weekly call. As an example, we could have done a lot better if we had speeded up our juju startup time at the beginning. The whole parallel project is acknowledging the importance of faster turnarounds. Long wait times are arguably an indication of problems, in addition to being a problem itself. ACTION ITEM: add this question to the weekly retrospective call's problem identification checklist. - gary_poster: We are not delivering value incrementally. Can we be? benji: We are fixing some bugs, so that is incremental value. bac: Maybe we should actually try to fix the big critical thing we encountered with Apache (the "real" bug for bug 1004088). _______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : launchpad-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp