Re: [Launchpad-dev] cold cache timeouts == OOPS == Critical

Robert Collins Tue, 02 Aug 2011 12:20:05 -0700

On Wed, Aug 3, 2011 at 1:53 AM, Gary Poster <gary.pos...@canonical.com> wrote:


> Past apps I've worked on have regarded hot cache bugs as critical, and cold 
> cache bugs as something to cope with, one way or another.  What LP has now is 
> a higher standard, which is nice, except that we haven't managed to meet the 
> lower one yet.
>
> This might just be an observation that I've shared, and we all nod our heads 
> and move on.  That's fine.  I'm also fine with considering changing our 
> policies.  Options would include the following:
>
>  * cold cache bugs are a lower priority, or even Won't Fix.
>  * cold cache bugs are grouped together in a single critical bug which is 
> about keeping out caches hot (I'm not sure what, if anything, can be improved 
> here, to be clear; I'm speaking in the abstract).  That kind of change 
> wouldn't make the problem go away, though; it would just make it less 
> frequent.

I see a couple of factors in considering a change here.

Firstly, the reasons for oopses-are-critical: there are two:
 * An OOPS usually means a user being unable to use the system
 * An OOPS is something we need to investigate

So any stream of unimportant OOPSes sucks our maintenance squads time
- we need to fix things so we don't see them, so that when we get an
important OOPS, we can leap on it and fix it: we want a good signal to
noise ratio.

And we want the system to work for users.

LP's database is 300GB, more or less. Thats a -lot- to fit into
memory, and thats just after a complete pack-and-optimise due to our
rebuilding everything.

URLs that are rarely used are more likely to have cold cache
behaviour; and so more likely to be slow and timeout.

So, I think we need to design with cold cache in mind, at least with
our current environment. Designing with cold cache in mind has
multiple benefits:
 - it makes it *safe* for us to run with less memory than DB - so its
cheaper to run LP as we continue to grow
 - it will help with hot cache operations because we'll be doing less
work for them as well
 - it helps when we have to reboot a db server, if the clients can
tolerate it not having the whole DB in memory right after startup.

When I joined as TA a year and a bit ago, the whole team cared about
performance, but was having trouble executing in a systematic way; we
have come a tremendous distance since, and learnt a great deal about
what makes the system perform well or poorly.

Its true that we have not yet fixed every slow page that was already
slow a year ago, but then we knew we had a lot of performance debt to
address.

I think fixing things to be tolerant of *some* cold cache situations
fits under the existing approach just fine - we will need some schema
refactorings, some of the time; other times its just query tuning so
that we don't read cold rows.

We probably cannot remove -all- cold cache effects -all- the time. I
suggest we be guided by the numbers: if a page is in the timeout
report, then it was genuinely too slow.

HTH
-Rob

_______________________________________________
Mailing list: https://launchpad.net/~launchpad-dev
Post to     : launchpad-dev@lists.launchpad.net
Unsubscribe : https://launchpad.net/~launchpad-dev
More help   : https://help.launchpad.net/ListHelp

Re: [Launchpad-dev] cold cache timeouts == OOPS == Critical

Reply via email to