I've been chatting with my counterparts in other web teams at Canonical for a while now about how we all gather failure information on our services. What I've found out is that we have a pretty fragmented solution to this problem - different reimplementations of OOPSes, forks of the code base. Both Launchpad and other teams have scaling and latency issues surrounding OOPSes.
So, in preparation for fixing *our* issues around OOPSes, I've now written up a LEP describing what I see as our requirements and constraints, as well as some of the requirements and constraints from Ubuntu One. I expect to get similar things from other web teams over the next week or so. (If I don't, I'll go nag :P). I'd really appreciate feedback and critique of the LEP - are there hidden assumptions I should call out that will influence our results? Have I missed a crucial problem? Is there a canned solution we can just grab and use. The LEP: https://dev.launchpad.net/LEP/OopsDisplay This will be a new codebase for several reasons: * the only thing in common with the existing code base will be the sql statement normalising code * the project, like other components of Launchpad, will be AGPL3 * Transitioning the existing code base will run into the very friction that makes it hard to improve on at the moment (we've had several engineers founder trying to do nontrivial changes to it). At this point, I think that we should do concentrated work on this sometime after doing the merge machinery and parallel testing work, but it may be that some interested folk want to do patches during idle cycles : I'll see about bootstrapping a minimal environment once all the constraints are in and the LEP has been reviewed by jml. -Rob _______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : [email protected] Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp

