just an update on this matter (I'm working with Alexis):
- the application we are talking about is fp-dreamland.
- it's not a minor issue for us as it impact directly our users.
- we have been observing that for almost two weeks now.
- it's not a small application either, we are talking about more than
an hundred instances being killed at once.
- we are paying customer, and growing fast, it would be really nice to
have some feedback from google people.

Thanks,

Olivier


On Aug 3, 4:45 am, Alexis <[email protected]> wrote:
> Some more details while searching the logs:
>
> The latest burst (we have about one every 1-2 hours) shows this:
> We had about 90 instances up.
> We got DeadlineExceeded error during two minutes. During the first
> 30seconds, 50 successive error requests have no pending_ms. Then
> during the next 1m30 they nearly all have pending_ms and one third are
> loading requests.
>
> So my guess is that it's not linked to the scheduler:
> something goes wrong and requests are killing the instances during the
> first 30 seconds, then killing the remaining ones with an increased
> latency until what goes wrong is resolved.
>
> On 3 août, 10:20, Alexis <[email protected]> wrote:
>
>
>
>
>
>
>
> > We are using Python, and we are not using back-ends or taskqueue.
> > But most of our requests fetch entities of the same kind, so a "kind-
> > locking" can be relevant.
> > I'll setup timeouts on datastore operations to see if this is what is
> > going wrong (and looks good to setup these anyway).
>
> > Seems logical to have pending_ms values if the app have no instances
> > available instead of the one hundred that used to serve the traffic...
> > however requests should be able to perform in less than 20sec. Nearly
> > all the requests failing with this error, and while the instances get
> > killed off, have this pending_ms value, and don't have the
> > loading_request=1.
> > So I'm not sure whether these DeadlineExceeded errors come first or as
> > a consequence of the instances being killed: they are not
> > loading_request and the warning in the logs says it may kill the
> > instance, but we have these pending_ms showing that we already lack of
> > instances.
>
> > The traffic is very steady.
>
> > On 3 août, 05:49, Robert Kluin <[email protected]> wrote:
>
> > > Interesting.  I've been seen exactly the same strange behavior across
> > > several apps as well.  Suddenly instances will get killed and
> > > restarted in large batches.  This happens even with low request
> > > latency, small memory usage (similar to yours < 50mb), low error
> > > rates, and steady traffic.  I pretty convinced this is tied to the
> > > scheduler changes they've been making over the past few weeks.
>
> > > As a side note, the pending_ms value (9321) indicates that the request
> > > sat there waiting to be serviced for quite a long time.  That won't
> > > leave as much time to respond to the requests.  Do you always see
> > > bursts of those when your instances get killed off?  Are you getting
> > > big spikes in traffic when this happens or is it steady?
>
> > > Robert
>
> > > On Tue, Aug 2, 2011 at 05:24, Alexis <[email protected]> wrote:
> > > > Hi,
>
> > > > I've got a similar issue: lots of DeadlineExceeded errors since a few
> > > > weeks. I'm on the master-slave datastore too, but what I'm reporting
> > > > happened again one hour ago.
>
> > > > These errors happen in bursts, and I recently realized that it was in
> > > > fact shutting down ALL instances of the application.
> > > > (In the logs, I also have this warning: A serious problem was
> > > > encountered with the process that handled this request, causing it to
> > > > exit. This is likely to cause a new process to be used for the next
> > > > request to your application. If you see this message frequently, you
> > > > may be throwing exceptions during the initialization of your
> > > > application. (Error code 104))
> > > > This does not happen when an instance is spinning up but after several
> > > > hours.
>
> > > > The trace I get along with the DeadlineExceeded errors show that it
> > > > happens in the second phase: while the app is trying to fallback
> > > > gracefully because of an other error (that does not appears in logs).
> > > > Request reported processing time can be like this: ms=100878
> > > > cpu_ms=385 api_cpu_ms=58 cpm_usd=0.010945 pending_ms=9321
>
> > > > Here is a screenshot of the admin page, showing that all instances
> > > > have been shut down about 7 minutes ago, even resident ones:
> > > >http://dl.dropbox.com/u/497622/spinDown.png
>
> > > > The app do work in batches (although not always small ones). But
> > > > request processing time is usually good enough (see average latency on
> > > > the screen shot).
> > > > I'm trying things on my testing applications to see what can be wrong
> > > > but it's still not clear for me and I'm running short of ideas...
>
> > > > Any suggestions?
>
> > > > On 2 août, 06:21, Robert Kluin <[email protected]> wrote:
> > > >> Hi Will,
> > > >>   I assume this is on the master-slave datastore?  I think there were
> > > >> a number of large latency spikes in both the datastore and serving
> > > >> last week.
>
> > > >>   Some things to try:
> > > >>     - do work in smaller batches.
> > > >>     - if you're doing work serially, do it in batches.
> > > >>     - use async interfaces to do work in batches, but in parallel 
> > > >> using async.
>
> > > >>      http://code.google.com/appengine/docs/python/datastore/async.html
>
> > > >> Robert
>
> > > >> On Fri, Jul 29, 2011 at 18:35, Will Reiher <[email protected]> wrote:
> > > >> > I'm trying to debug this issue but I keep hitting a wall.
> > > >> > I keep trying new things on one of my deployments to see if I an get 
> > > >> > the
> > > >> > number of errors down but nothing seems to help. It all started in 
> > > >> > the last
> > > >> > week or so ago. I  also have some existing deployments that I have 
> > > >> > not
> > > >> > changed and are seeing these same errors while the code was never 
> > > >> > changed
> > > >> > and had been stable.
> > > >> > 1. This is is happening on existing code that has not changed 
> > > >> > recently
> > > >> > 2. The DeadlineExceededErrors are coming up randomly and at 
> > > >> > different points
> > > >> > in the code.
> > > >> > 3. Latency is pretty high and app engine seems to be spawning a lot 
> > > >> > of new
> > > >> > instances beyond my 3 included ones.
>
> > > >> > --
> > > >> > You received this message because you are subscribed to the Google 
> > > >> > Groups
> > > >> > "Google App Engine" group.
> > > >> > To view this discussion on the web visit
> > > >> >https://groups.google.com/d/msg/google-appengine/-/g_C4iPzPeo4J.
> > > >> > To post to this group, send email to 
> > > >> > [email protected].
> > > >> > To unsubscribe from this group, send email to
> > > >> > [email protected].
> > > >> > For more options, visit this group at
> > > >> >http://groups.google.com/group/google-appengine?hl=en.
>
> > > > --
> > > > You received this message because you are subscribed to the Google 
> > > > Groups "Google App Engine" group.
> > > > To post to this group, send email to [email protected].
> > > > To unsubscribe from this group, send email to 
> > > > [email protected].
> > > > For more options, visit this group 
> > > > athttp://groups.google.com/group/google-appengine?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Reply via email to