I just made a change to your app's configuration. Could you retry your repro? Did this help?
Alternatively, is there anything I could do to trigger the repro? On Sat, Sep 10, 2011 at 11:02 PM, Pol <[email protected]> wrote: > Still happening right now. I can repro pretty easily: schedule ~10 > image processing tasks in a queue, and 1 to 3 often fail due to that > re-entrancy bug. Then after they retry automatically and after a few > times usually go through. > > 5 instances idle / running, no load on the app: only 1 user currently. > > It really looks as if the scheduler when getting the callback from the > EC2 box is trying to serve it from the same instance from which > urlfetch() is running: there must be a bug where the scheduler thinks > the instance is not busy or something. > > If you look at the distribution of requests, something's clearly wrong > as 1 instance gets the majority of the traffic. > > QPS* Latency* Requests Errors Age Memory Availability > 0.000 0.0 ms 7 0 0:03:30 17.7 MBytes Dynamic > 0.000 0.0 ms 152 1 0:26:56 68.6 MBytes Dynamic > 0.000 0.0 ms 6 0 0:32:22 23.9 MBytes Dynamic > 0.000 0.0 ms 4 0 0:32:12 24.6 MBytes Dynamic > 0.000 0.0 ms 3 0 0:32:09 14.5 MBytes Dynamic > > On Sep 9, 1:43 pm, Jon McAlister <[email protected]> wrote: >> Ok, let me know if it recurs. >> >> >> >> >> >> >> >> On Wed, Sep 7, 2011 at 3:07 PM, Pol <[email protected]> wrote: >> > Hi Jon, >> >> > Great! >> >> > everpix-alpha >> >> > I still see what appears to be the bug now and then, but I don't have >> > a reproducible case just waiting in a task queue anymore. >> >> > On Sep 7, 1:19 pm, Jon McAlister <[email protected]> wrote: >> >> Hi Pol, >> >> >> I think I have a change that will fix this, which should be available >> >> once 1.5.4 is pushed out. What is the app-id here? >> >> >> On Mon, Sep 5, 2011 at 10:41 AM, Pol <[email protected]> wrote: >> >> > Hi, >> >> >> > Our app settings are as follow: >> >> > - Python +HRD >> >> > - Max Idle Instances: ( 2 ) >> >> > - Min Pending Latency: ( 100ms ) >> >> > As of right now, there are 3 instances alive. >> >> >> > Without going too much into details, we have GAE integrated with EC2 >> >> > on which we run remote image processing tools. The tools are called >> >> > directly using HTTP GETs from GAE and they returned their results as >> >> > JSON (with gzip content encoding). >> >> >> > There are currently 3 tasks in the processing queue on GAE >> >> > continuously failing: the urlfetch() calls to the EC2 tool reach the >> >> > 10 seconds timeout and bail. What doesn't make sense is that calling >> >> > the EC2 tool directly using curl from random machines succeeds in less >> >> > than 1 second. >> >> >> > But here's the trick: under certain circumstances, the EC2 tool will >> >> > call back to GAE (HEAD request that does a single db.get()) to check >> >> > if the image has already been processed and this happens for these 3 >> >> > stuck tasks. >> >> >> > If calling the EC2 tool from the command line and curl, we have the >> >> > normal behavior: >> >> > - EC2 tool retrieves image from arbitrary URL and computes a hash >> >> > - EC2 tool does a HEAD call to GAE passing this hash to see if image >> >> > has been already processed >> >> > - If yes, return empty JSON >> >> > - If no, process image and return full JSON >> >> > This takes about 1 second. >> >> >> > The exact same call done from GAE produces this behavior: >> >> > - EC2 tool retrieves image from arbitrary URL and computes a hash >> >> > - EC2 tool does a HEAD call to GAE passing this hash to see if image >> >> > has been already processed >> >> > -> HEAD call hangs <--- RE-ENTRANCY /DEADLOCKBUG in GAE >> >> > -> urlfetch() from GAE to EC2 reaches 10 seconds deadline and >> >> > aborts <-- BREAKSDEADLOCK >> >> > -> HEAD call finally returns >> >> > - EC2 tool completes normally >> >> >> > GAE logs confirm the bug: >> >> >> > HEAD call from EC2 / curl origina >> >> > 2011-09-05 10:19:52.502 /api/has_backing? >> >> > bid=90e794f348ac76520076f5d02bc67c87c8a9185b8d36affe8377e73fe4820703 >> >> > 200 368ms 48cpu_ms 8api_cpu_ms 0kb Everpix-Processor >> >> >> > HEAD call from EC2 / GAE origin >> >> > 2011-09-05 10:20:44.670 /api/has_backing? >> >> > bid=90e794f348ac76520076f5d02bc67c87c8a9185b8d36affe8377e73fe4820703 >> >> > 200 9712ms 48cpu_ms 8api_cpu_ms 0kb Everpix-Processor >> >> > 2011-09-05 10:20:44.547 /task/import_photo 500 10348ms 28cpu_ms >> >> > 8api_cpu_ms 0kb AppEngine-Google; (+http://code.google.com/appengine) >> >> > (see how the HEAD /api/has_backing call hangs for almost 10 seconds >> >> > and only returns *after* /task/import_photo and its urlfetch() call to >> >> > EC2 has aborted) >> >> >> > And finally, AppStats confirms that it's not the head() Python >> >> > execution itself that's hanging: >> >> >> > (1) 2011-09-05 09:16:06.843 "HEAD /api/has_backing? >> >> > bid=3bc4aeb08e01d3ba4bfab3282d2a198984a4fc1fab2ad9d1e8a39ee3cddd14da" >> >> > 200 real=227ms cpu=24ms api=8ms overhead=0ms (1 RPC) >> >> > (2) 2011-09-05 09:15:56.422 "POST /task/import_photo" 500 real=10002ms >> >> > cpu=33ms api=8ms overhead=0ms (3 RPCs) >> >> > (3) 2011-09-05 09:15:49.328 "HEAD /api/has_backing? >> >> > bid=90e794f348ac76520076f5d02bc67c87c8a9185b8d36affe8377e73fe4820703" >> >> > 200 real=297ms cpu=21ms api=8ms overhead=0ms (1 RPC) >> >> >> > This issue is currently 100% reproducible. >> >> >> > - Pol >> >> >> > -- >> >> > You received this message because you are subscribed to the Google >> >> > Groups "Google App Engine" group. >> >> > To post to this group, send email to [email protected]. >> >> > To unsubscribe from this group, send email to >> >> > [email protected]. >> >> > For more options, visit this group >> >> > athttp://groups.google.com/group/google-appengine?hl=en. >> >> > -- >> > You received this message because you are subscribed to the Google Groups >> > "Google App Engine" group. >> > To post to this group, send email to [email protected]. >> > To unsubscribe from this group, send email to >> > [email protected]. >> > For more options, visit this group >> > athttp://groups.google.com/group/google-appengine?hl=en. > > -- > You received this message because you are subscribed to the Google Groups > "Google App Engine" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/google-appengine?hl=en. > > -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
