Hi,

Our app settings are as follow:
- Python + HRD
- Max Idle Instances: ( 2 )
- Min Pending Latency: ( 100ms )
As of right now, there are 3 instances alive.

Without going too much into details, we have GAE integrated with EC2
on which we run remote image processing tools. The tools are called
directly using HTTP GETs from GAE and they returned their results as
JSON (with gzip content encoding).

There are currently 3 tasks in the processing queue on GAE
continuously failing: the urlfetch() calls to the EC2 tool reach the
10 seconds timeout and bail. What doesn't make sense is that calling
the EC2 tool directly using curl from random machines succeeds in less
than 1 second.

But here's the trick: under certain circumstances, the EC2 tool will
call back to GAE (HEAD request that does a single db.get()) to check
if the image has already been processed and this happens for these 3
stuck tasks.

If calling the EC2 tool from the command line and curl, we have the
normal behavior:
- EC2 tool retrieves image from arbitrary URL and computes a hash
- EC2 tool does a HEAD call to GAE passing this hash to see if image
has been already processed
  - If yes, return empty JSON
  - If no, process image and return full JSON
This takes about 1 second.

The exact same call done from GAE produces this behavior:
- EC2 tool retrieves image from arbitrary URL and computes a hash
- EC2 tool does a HEAD call to GAE passing this hash to see if image
has been already processed
  -> HEAD call hangs  <--- RE-ENTRANCY / DEADLOCK BUG in GAE
  -> urlfetch() from GAE to EC2 reaches 10 seconds deadline and
aborts  <-- BREAKS DEADLOCK
  -> HEAD call finally returns
- EC2 tool completes normally

GAE logs confirm the bug:

HEAD call from EC2 / curl origina
2011-09-05 10:19:52.502 /api/has_backing?
bid=90e794f348ac76520076f5d02bc67c87c8a9185b8d36affe8377e73fe4820703
200 368ms 48cpu_ms 8api_cpu_ms 0kb Everpix-Processor

HEAD call from EC2 / GAE origin
2011-09-05 10:20:44.670 /api/has_backing?
bid=90e794f348ac76520076f5d02bc67c87c8a9185b8d36affe8377e73fe4820703
200 9712ms 48cpu_ms 8api_cpu_ms 0kb Everpix-Processor
2011-09-05 10:20:44.547 /task/import_photo 500 10348ms 28cpu_ms
8api_cpu_ms 0kb AppEngine-Google; (+http://code.google.com/appengine)
(see how the HEAD /api/has_backing call hangs for almost 10 seconds
and only returns *after* /task/import_photo and its urlfetch() call to
EC2 has aborted)

And finally, AppStats confirms that it's not the head() Python
execution itself that's hanging:

(1) 2011-09-05 09:16:06.843 "HEAD /api/has_backing?
bid=3bc4aeb08e01d3ba4bfab3282d2a198984a4fc1fab2ad9d1e8a39ee3cddd14da"
200 real=227ms cpu=24ms api=8ms overhead=0ms (1 RPC)
(2) 2011-09-05 09:15:56.422 "POST /task/import_photo" 500 real=10002ms
cpu=33ms api=8ms overhead=0ms (3 RPCs)
(3) 2011-09-05 09:15:49.328 "HEAD /api/has_backing?
bid=90e794f348ac76520076f5d02bc67c87c8a9185b8d36affe8377e73fe4820703"
200 real=297ms cpu=21ms api=8ms overhead=0ms (1 RPC)


This issue is currently 100% reproducible.

- Pol

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Reply via email to