On Wed, Aug 3, 2011 at 1:53 AM, Gary Poster <gary.pos...@canonical.com> wrote:
> Past apps I've worked on have regarded hot cache bugs as critical, and cold > cache bugs as something to cope with, one way or another. What LP has now is > a higher standard, which is nice, except that we haven't managed to meet the > lower one yet. > > This might just be an observation that I've shared, and we all nod our heads > and move on. That's fine. I'm also fine with considering changing our > policies. Options would include the following: > > * cold cache bugs are a lower priority, or even Won't Fix. > * cold cache bugs are grouped together in a single critical bug which is > about keeping out caches hot (I'm not sure what, if anything, can be improved > here, to be clear; I'm speaking in the abstract). That kind of change > wouldn't make the problem go away, though; it would just make it less > frequent. I see a couple of factors in considering a change here. Firstly, the reasons for oopses-are-critical: there are two: * An OOPS usually means a user being unable to use the system * An OOPS is something we need to investigate So any stream of unimportant OOPSes sucks our maintenance squads time - we need to fix things so we don't see them, so that when we get an important OOPS, we can leap on it and fix it: we want a good signal to noise ratio. And we want the system to work for users. LP's database is 300GB, more or less. Thats a -lot- to fit into memory, and thats just after a complete pack-and-optimise due to our rebuilding everything. URLs that are rarely used are more likely to have cold cache behaviour; and so more likely to be slow and timeout. So, I think we need to design with cold cache in mind, at least with our current environment. Designing with cold cache in mind has multiple benefits: - it makes it *safe* for us to run with less memory than DB - so its cheaper to run LP as we continue to grow - it will help with hot cache operations because we'll be doing less work for them as well - it helps when we have to reboot a db server, if the clients can tolerate it not having the whole DB in memory right after startup. When I joined as TA a year and a bit ago, the whole team cared about performance, but was having trouble executing in a systematic way; we have come a tremendous distance since, and learnt a great deal about what makes the system perform well or poorly. Its true that we have not yet fixed every slow page that was already slow a year ago, but then we knew we had a lot of performance debt to address. I think fixing things to be tolerant of *some* cold cache situations fits under the existing approach just fine - we will need some schema refactorings, some of the time; other times its just query tuning so that we don't read cold rows. We probably cannot remove -all- cold cache effects -all- the time. I suggest we be guided by the numbers: if a page is in the timeout report, then it was genuinely too slow. HTH -Rob _______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : launchpad-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp