[google-appengine] Re: Latency spikes and aborted requests in the last 24 hours

Jason C Thu, 23 Sep 2010 10:06:03 -0700

We get lots of bursts of loading failures. Most recently: 9.45a to
9.47a log time (2010-09-23). Appid: steprep


We also get lots of ongoing 10-second timeout. Using the logs, it
seems to me that our user-facing requests are below the 1000ms
threshold, so I don't know why we get so many 10-second timeouts
(i.e., the "Request was aborted after waiting too long to attempt to
service your request. This may happen sporadically when the App Engine
serving cluster is under unexpectedly high or uneven load. If you see
this message frequently, please contact the App Engine team." error).

Our QPS varies throughout the day from 1 to around 12, though we have
jobs that introduce short spikes of 50-60.

j

On Sep 23, 10:39 am, Eli Jones <[email protected]> wrote:
> My app is extremely lightweight.
>
> The average CPU usage for a cold start falls in these ranges:
>
> CPU_MS:    1030 - 1230
> API_CPU_MS:  408
>
> The rough outline of the code that runs is:
>
> 1.  Pull credentials form memcache and decrypt.
> 2.  Connect to gdata service, and get some data.
> 3.  Create 4 new entities (of same Model type) containing this fetched data
> and append to a list.
> 4.  Stick those entities in memcache as protobufs.
> 5.  Create new entity (for a different Model) and append to the list used in
> Step 3.
> 6.  db.put(list) to the datastore. (or keep trying until success)
> 7. If necessary, add task to run at next 5 minute mark.
>
> So.. as long as memcache is working and the encrypted credentials are
> there.. there are not any Gets from the datastore.  There is just the Put
> that is putting 4 entities of one model type and 1 entity of another model
> type all at once.
>
> timezone from Pytz is used for correct timezone calculations at the start of
> the day.. to make sure that the initial task starts at the correct time.
>  (At this point I could move the "from pytz import timezone" down to just
> the function that calculates the initial start delay.. but I avoid making
> changes to this code unless it's really necessary (timezone was used in more
> than one place in the past.. so it was imported at the top of the file).
>
> On Thu, Sep 23, 2010 at 12:20 PM, Ikai Lan (Google) <
>
>
>
> [email protected] <ikai.l%[email protected]>> wrote:
> > It sounds like these are impacting Python developers - I haven't heard any
> > Java app developers chime in yet.
>
> > What's the ballpark of the QPS you guys are doing? <1qps, 10qps, 100qps,
> > etc?
>
> > --
> > Ikai Lan
> > Developer Programs Engineer, Google App Engine
> > Blogger:http://googleappengine.blogspot.com
> > Reddit:http://www.reddit.com/r/appengine
> > Twitter:http://twitter.com/app_engine
>
> > On Thu, Sep 23, 2010 at 12:15 PM, Eli Jones <[email protected]> wrote:
>
> >> Also, a cold start for this task can take as little as 960 ms (where it
> >> shows the log detail: "This request caused a new process to be started for
> >> your application...").
>
> >> And, the task is doing a URLFetch, so.. I'm sure that is taking up a chunk
> >> of the running time.
>
> >> One of my imports is pytz (no django or anything like that though).. so
> >> that could also be of interest.
>
> >> On Thu, Sep 23, 2010 at 12:10 PM, Eli Jones <[email protected]> wrote:
>
> >>> My app is in Python.  The only thing it does between 6:30 AM PST and 1:00
> >>> PM PST is run a single chained task.  The task just runs once every 5
> >>> minutes.  (There are no other backend processes or users).
>
> >>> Yesterday at 11:00 AM PST and 11:02 AM PST, the task experienced the 10
> >>> second, 500 error (with 0 CPU_MS used).. then the task successfully 
> >>> retried
> >>> at 11:02:55 AM PST.
>
> >>> There was another one (not mentioned in your list of times for yesterday)
> >>> at 7:45 AM PST.  Failed after 10 seconds with a 500 error and 0 CPU_MS 
> >>> used.
> >>>  The task successfully retried at 7:47 AM PST (for some reason, the tasks 
> >>> in
> >>> my app wait 120 seconds for the first retry)
>
> >>> For some reason, the log doesn't contain the message that usually goes
> >>> with this error (there is no detailed log message):
>
> >>>  "Request was aborted after waiting too long to attempt to service your
> >>> request. This may happen sporadically when the App Engine serving cluster 
> >>> is
> >>> under unexpectedly high or uneven load. If you see this message 
> >>> frequently,
> >>> please contact the App Engine team."
>
> >>> But, it has the exact same fingerprint. (It just seems that the logs are
> >>> trimming the detailed messages faster than usual.)
>
> >>> This chained task runs 79 times each day (this would be 82 requests
> >>> counting these 3 errors).. so these errors accounted for 3.66 % of the
> >>> requests during the day yesterday.
>
> >>> On Thu, Sep 23, 2010 at 11:19 AM, Ikai Lan (Google) <
> >>> [email protected] <ikai.l%[email protected]>> wrote:
>
> >>>> Hey guys,
>
> >>>> We've been tracking latency spikes and intermittent batches of aborted
> >>>> requests over the last 24 hour period. There were at least 3 periods of
> >>>> spikes yesterday:
>
> >>>> ~1100 Pacific (moderate, brief period of high)
> >>>> ~1300 Pacific (small)
> >>>> ~1530 Pacific (small)
>
> >>>> Community reports seem to indicate that these are related to loading
> >>>> requests. If you are being affected, can you report your environment,
> >>>> language and if these times correlate with your experiences in this 
> >>>> thread?
> >>>> Are there periods that this particular report did not cover? Please 
> >>>> provide
> >>>> measurements, times and application IDs.
>
> >>>> --
> >>>> Ikai Lan
> >>>> Developer Programs Engineer, Google App Engine
> >>>> Blogger:http://googleappengine.blogspot.com
> >>>> Reddit:http://www.reddit.com/r/appengine
> >>>> Twitter:http://twitter.com/app_engine
>
> >>>>  --
> >>>> You received this message because you are subscribed to the Google
> >>>> Groups "Google App Engine" group.
> >>>> To post to this group, send email to [email protected].
> >>>> To unsubscribe from this group, send email to
> >>>> [email protected]<google-appengine%2Bunsubscrib
> >>>>  [email protected]>
> >>>> .
> >>>> For more options, visit this group at
> >>>>http://groups.google.com/group/google-appengine?hl=en.
>
> >>  --
> >> You received this message because you are subscribed to the Google Groups
> >> "Google App Engine" group.
> >> To post to this group, send email to [email protected].
> >> To unsubscribe from this group, send email to
> >> [email protected]<google-appengine%2Bunsubscrib
> >>  [email protected]>
> >> .
> >> For more options, visit this group at
> >>http://groups.google.com/group/google-appengine?hl=en.
>
> >  --
> > You received this message because you are subscribed to the Google Groups
> > "Google App Engine" group.
> > To post to this group, send email to [email protected].
> > To unsubscribe from this group, send email to
> > [email protected]<google-appengine%2Bunsubscrib 
> > [email protected]>
> > .
> > For more options, visit this group at
> >http://groups.google.com/group/google-appengine?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

[google-appengine] Re: Latency spikes and aborted requests in the last 24 hours

Reply via email to