just an update on this matter (I'm working with Alexis): - the application we are talking about is fp-dreamland. - it's not a minor issue for us as it impact directly our users. - we have been observing that for almost two weeks now. - it's not a small application either, we are talking about more than an hundred instances being killed at once. - we are paying customer, and growing fast, it would be really nice to have some feedback from google people.
Thanks, Olivier On Aug 3, 4:45 am, Alexis <[email protected]> wrote: > Some more details while searching the logs: > > The latest burst (we have about one every 1-2 hours) shows this: > We had about 90 instances up. > We got DeadlineExceeded error during two minutes. During the first > 30seconds, 50 successive error requests have no pending_ms. Then > during the next 1m30 they nearly all have pending_ms and one third are > loading requests. > > So my guess is that it's not linked to the scheduler: > something goes wrong and requests are killing the instances during the > first 30 seconds, then killing the remaining ones with an increased > latency until what goes wrong is resolved. > > On 3 août, 10:20, Alexis <[email protected]> wrote: > > > > > > > > > We are using Python, and we are not using back-ends or taskqueue. > > But most of our requests fetch entities of the same kind, so a "kind- > > locking" can be relevant. > > I'll setup timeouts on datastore operations to see if this is what is > > going wrong (and looks good to setup these anyway). > > > Seems logical to have pending_ms values if the app have no instances > > available instead of the one hundred that used to serve the traffic... > > however requests should be able to perform in less than 20sec. Nearly > > all the requests failing with this error, and while the instances get > > killed off, have this pending_ms value, and don't have the > > loading_request=1. > > So I'm not sure whether these DeadlineExceeded errors come first or as > > a consequence of the instances being killed: they are not > > loading_request and the warning in the logs says it may kill the > > instance, but we have these pending_ms showing that we already lack of > > instances. > > > The traffic is very steady. > > > On 3 août, 05:49, Robert Kluin <[email protected]> wrote: > > > > Interesting. I've been seen exactly the same strange behavior across > > > several apps as well. Suddenly instances will get killed and > > > restarted in large batches. This happens even with low request > > > latency, small memory usage (similar to yours < 50mb), low error > > > rates, and steady traffic. I pretty convinced this is tied to the > > > scheduler changes they've been making over the past few weeks. > > > > As a side note, the pending_ms value (9321) indicates that the request > > > sat there waiting to be serviced for quite a long time. That won't > > > leave as much time to respond to the requests. Do you always see > > > bursts of those when your instances get killed off? Are you getting > > > big spikes in traffic when this happens or is it steady? > > > > Robert > > > > On Tue, Aug 2, 2011 at 05:24, Alexis <[email protected]> wrote: > > > > Hi, > > > > > I've got a similar issue: lots of DeadlineExceeded errors since a few > > > > weeks. I'm on the master-slave datastore too, but what I'm reporting > > > > happened again one hour ago. > > > > > These errors happen in bursts, and I recently realized that it was in > > > > fact shutting down ALL instances of the application. > > > > (In the logs, I also have this warning: A serious problem was > > > > encountered with the process that handled this request, causing it to > > > > exit. This is likely to cause a new process to be used for the next > > > > request to your application. If you see this message frequently, you > > > > may be throwing exceptions during the initialization of your > > > > application. (Error code 104)) > > > > This does not happen when an instance is spinning up but after several > > > > hours. > > > > > The trace I get along with the DeadlineExceeded errors show that it > > > > happens in the second phase: while the app is trying to fallback > > > > gracefully because of an other error (that does not appears in logs). > > > > Request reported processing time can be like this: ms=100878 > > > > cpu_ms=385 api_cpu_ms=58 cpm_usd=0.010945 pending_ms=9321 > > > > > Here is a screenshot of the admin page, showing that all instances > > > > have been shut down about 7 minutes ago, even resident ones: > > > >http://dl.dropbox.com/u/497622/spinDown.png > > > > > The app do work in batches (although not always small ones). But > > > > request processing time is usually good enough (see average latency on > > > > the screen shot). > > > > I'm trying things on my testing applications to see what can be wrong > > > > but it's still not clear for me and I'm running short of ideas... > > > > > Any suggestions? > > > > > On 2 août, 06:21, Robert Kluin <[email protected]> wrote: > > > >> Hi Will, > > > >> I assume this is on the master-slave datastore? I think there were > > > >> a number of large latency spikes in both the datastore and serving > > > >> last week. > > > > >> Some things to try: > > > >> - do work in smaller batches. > > > >> - if you're doing work serially, do it in batches. > > > >> - use async interfaces to do work in batches, but in parallel > > > >> using async. > > > > >> http://code.google.com/appengine/docs/python/datastore/async.html > > > > >> Robert > > > > >> On Fri, Jul 29, 2011 at 18:35, Will Reiher <[email protected]> wrote: > > > >> > I'm trying to debug this issue but I keep hitting a wall. > > > >> > I keep trying new things on one of my deployments to see if I an get > > > >> > the > > > >> > number of errors down but nothing seems to help. It all started in > > > >> > the last > > > >> > week or so ago. I also have some existing deployments that I have > > > >> > not > > > >> > changed and are seeing these same errors while the code was never > > > >> > changed > > > >> > and had been stable. > > > >> > 1. This is is happening on existing code that has not changed > > > >> > recently > > > >> > 2. The DeadlineExceededErrors are coming up randomly and at > > > >> > different points > > > >> > in the code. > > > >> > 3. Latency is pretty high and app engine seems to be spawning a lot > > > >> > of new > > > >> > instances beyond my 3 included ones. > > > > >> > -- > > > >> > You received this message because you are subscribed to the Google > > > >> > Groups > > > >> > "Google App Engine" group. > > > >> > To view this discussion on the web visit > > > >> >https://groups.google.com/d/msg/google-appengine/-/g_C4iPzPeo4J. > > > >> > To post to this group, send email to > > > >> > [email protected]. > > > >> > To unsubscribe from this group, send email to > > > >> > [email protected]. > > > >> > For more options, visit this group at > > > >> >http://groups.google.com/group/google-appengine?hl=en. > > > > > -- > > > > You received this message because you are subscribed to the Google > > > > Groups "Google App Engine" group. > > > > To post to this group, send email to [email protected]. > > > > To unsubscribe from this group, send email to > > > > [email protected]. > > > > For more options, visit this group > > > > athttp://groups.google.com/group/google-appengine?hl=en. -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
