[google-appengine] Re: 502 Server Errors after upgrade?

Nick Winter Tue, 03 Mar 2009 12:05:19 -0800

Yes, I'm still seeing problems. We've been averaging 4 seconds per
request for the past six hours, roughly correlated to the elevated
latency that you say is within tolerance and is averaging 250 ms per
request. Some of that is due to a recently identified performance
issue with one of our common handlers being identified as a "high CPU"
handler and getting sidelined to make room for more performant apps'
handlers. We haven't been able to fix that yet, but it's not just that
handler: it's every dynamic handler.

This isn't just an issue since last night, though; it's been on and
off for two months. We're still seeing widely variable response times
on almost every request. In the admin console, for example, most
requests take a normal second or so, but a fifth of them or so take
between 7-20 seconds to respond. This is stuff like expanding one log
entry with no log messages in it. I know that's not in my code. Is it
instance startup costs? I think so. From my logs, I'll load something
simple like a FAQ page, and it's fine, whenever I have to start
another instance and import Django and some views, well, that's 7-14
seconds to do it. Have we changed anything in the past two months? A
couple very small things, but mostly our development has been on
another branch. It goes between "working just fine on every request"
and "several seconds to spawn an instance running the same code as
usual" from day to day and time to time. It's just been getting more
frequent and worse as time goes on.

The simplest handler we have, which imports our models, reads one
thing at random from the datastore, and returns it, with no web
framework, normally runs in 50-250 ms and takes 50-80 ms of cpu, but
sometimes (new instance?) takes 1300-2000 ms and 130-300 ms cpu. But I
made an even simpler handler as a test, with no import, that just
prints a random number. Starting a new instance for that only takes a
70-250 ms response time. And if I make that import our models every
time, it doesn't change anything. It's almost like the cost of
starting a new instance is multiplied by the cost of the instantiation
code is multiplied by whether App Engine is feeling cheerful that
hour.

It could be our code, I could see that. If there's a reason why the
admin console would be so slow at the same times and in the same way
as the rest of our site, maybe it would shed some light. If we're
doing something subtly wrong when we import Django (0.96) and our
models, I could see that, but only if there's a reason why we wouldn't
see it sometimes and would at other times, and why it would have been
getting worse without any code changes.

It would also help a lot to know exactly how this works (from the
Quotas page):
"Applications that are heavily cpu-bound, on the other hand, may incur
some additional latency in long-running requests in order to make room
for other apps sharing the same servers."
Marzia has said that this is per-request, so I'm assuming that certain
handlers will get sidelined, but this is happening site-wide --
otherwise, it would make perfect sense. How is a handler identified as
high-cpu? How long does that classification last?

Thanks,
--Nick (app id: skrit)

On Mar 3, 1:25 pm, Brett Slatkin <[email protected]> wrote:
> Hi Nick,
>
> On Tue, Mar 3, 2009 at 8:55 AM, Nick Winter <[email protected]> wrote:
>
> >http://code.google.com/status/appengine/detail/serving/2009/03/03#ae-...
>
> > Just about every day for the past several weeks, there's been elevated
> > latency like this, usually at similar times of day. It was unfortunate
> > and frustrating before, but since last night our development is
> > stalled because every part of App Engine is too slow to do any testing
> > or data manipulation right now. 5 seconds per request?
>
> > I'm confident that the App Engine team will get a handle on the
> > performance and everything will be shiny once more, but it'd be nice
> > to hear some word as to what's going on. Are the servers just
> > overloaded? Did something go wrong with the maintenance last night? Is
> > anomaly-yellow serving to be expected?
>
> We had some unexpected issues during the maintenance last night which
> caused elevated latencies and errors for all applications. We resolved
> the issue around 8:45pm last night and things have returned to normal
> since. Please let me know if you're still seeing any problems.
>
> As for the elevated latency for the dynamic request metric (that you
> linked to), this is primarily a product of alert tolerances. We're
> still tuning our status site metrics to match real-world expectations
> of App Engine performance. You'll notice today that we've raised some
> of these tolerances by a little bit, causing many of the lines to go
> back to a blue color (i.e., everything OK).
>
> -Brett
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en
-~----------~----~----~----~------~----~------~--~---

[google-appengine] Re: 502 Server Errors after upgrade?

Reply via email to