Re: [google-appengine] Re: Re-entrancy / deadlock bug with scheduler?

Jon McAlister Mon, 12 Sep 2011 10:41:41 -0700

I just made a change to your app's configuration. Could you retry your
repro? Did this help?


Alternatively, is there anything I could do to trigger the repro?

On Sat, Sep 10, 2011 at 11:02 PM, Pol <[email protected]> wrote:
> Still happening right now. I can repro pretty easily: schedule ~10
> image processing tasks in a queue, and 1 to 3 often fail due to that
> re-entrancy bug. Then after they retry automatically and after a few
> times usually go through.
>
> 5 instances idle / running, no load on the app: only 1 user currently.
>
> It really looks as if the scheduler when getting the callback from the
> EC2 box is trying to serve it from the same instance from which
> urlfetch() is running: there must be a bug where the scheduler thinks
> the instance is not busy or something.
>
> If you look at the distribution of requests, something's clearly wrong
> as 1 instance gets the majority of the traffic.
>
> QPS*    Latency*        Requests        Errors  Age     Memory  Availability
> 0.000   0.0 ms  7       0       0:03:30 17.7 MBytes     Dynamic
> 0.000   0.0 ms  152     1       0:26:56 68.6 MBytes     Dynamic
> 0.000   0.0 ms  6       0       0:32:22 23.9 MBytes     Dynamic
> 0.000   0.0 ms  4       0       0:32:12 24.6 MBytes     Dynamic
> 0.000   0.0 ms  3       0       0:32:09 14.5 MBytes     Dynamic
>
> On Sep 9, 1:43 pm, Jon McAlister <[email protected]> wrote:
>> Ok, let me know if it recurs.
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Sep 7, 2011 at 3:07 PM, Pol <[email protected]> wrote:
>> > Hi Jon,
>>
>> > Great!
>>
>> > everpix-alpha
>>
>> > I still see what appears to be the bug now and then, but I don't have
>> > a reproducible case just waiting in a task queue anymore.
>>
>> > On Sep 7, 1:19 pm, Jon McAlister <[email protected]> wrote:
>> >> Hi Pol,
>>
>> >> I think I have a change that will fix this, which should be available
>> >> once 1.5.4 is pushed out. What is the app-id here?
>>
>> >> On Mon, Sep 5, 2011 at 10:41 AM, Pol <[email protected]> wrote:
>> >> > Hi,
>>
>> >> > Our app settings are as follow:
>> >> > - Python +HRD
>> >> > - Max Idle Instances: ( 2 )
>> >> > - Min Pending Latency: ( 100ms )
>> >> > As of right now, there are 3 instances alive.
>>
>> >> > Without going too much into details, we have GAE integrated with EC2
>> >> > on which we run remote image processing tools. The tools are called
>> >> > directly using HTTP GETs from GAE and they returned their results as
>> >> > JSON (with gzip content encoding).
>>
>> >> > There are currently 3 tasks in the processing queue on GAE
>> >> > continuously failing: the urlfetch() calls to the EC2 tool reach the
>> >> > 10 seconds timeout and bail. What doesn't make sense is that calling
>> >> > the EC2 tool directly using curl from random machines succeeds in less
>> >> > than 1 second.
>>
>> >> > But here's the trick: under certain circumstances, the EC2 tool will
>> >> > call back to GAE (HEAD request that does a single db.get()) to check
>> >> > if the image has already been processed and this happens for these 3
>> >> > stuck tasks.
>>
>> >> > If calling the EC2 tool from the command line and curl, we have the
>> >> > normal behavior:
>> >> > - EC2 tool retrieves image from arbitrary URL and computes a hash
>> >> > - EC2 tool does a HEAD call to GAE passing this hash to see if image
>> >> > has been already processed
>> >> >  - If yes, return empty JSON
>> >> >  - If no, process image and return full JSON
>> >> > This takes about 1 second.
>>
>> >> > The exact same call done from GAE produces this behavior:
>> >> > - EC2 tool retrieves image from arbitrary URL and computes a hash
>> >> > - EC2 tool does a HEAD call to GAE passing this hash to see if image
>> >> > has been already processed
>> >> >  -> HEAD call hangs  <--- RE-ENTRANCY /DEADLOCKBUG in GAE
>> >> >  -> urlfetch() from GAE to EC2 reaches 10 seconds deadline and
>> >> > aborts  <-- BREAKSDEADLOCK
>> >> >  -> HEAD call finally returns
>> >> > - EC2 tool completes normally
>>
>> >> > GAE logs confirm the bug:
>>
>> >> > HEAD call from EC2 / curl origina
>> >> > 2011-09-05 10:19:52.502 /api/has_backing?
>> >> > bid=90e794f348ac76520076f5d02bc67c87c8a9185b8d36affe8377e73fe4820703
>> >> > 200 368ms 48cpu_ms 8api_cpu_ms 0kb Everpix-Processor
>>
>> >> > HEAD call from EC2 / GAE origin
>> >> > 2011-09-05 10:20:44.670 /api/has_backing?
>> >> > bid=90e794f348ac76520076f5d02bc67c87c8a9185b8d36affe8377e73fe4820703
>> >> > 200 9712ms 48cpu_ms 8api_cpu_ms 0kb Everpix-Processor
>> >> > 2011-09-05 10:20:44.547 /task/import_photo 500 10348ms 28cpu_ms
>> >> > 8api_cpu_ms 0kb AppEngine-Google; (+http://code.google.com/appengine)
>> >> > (see how the HEAD /api/has_backing call hangs for almost 10 seconds
>> >> > and only returns *after* /task/import_photo and its urlfetch() call to
>> >> > EC2 has aborted)
>>
>> >> > And finally, AppStats confirms that it's not the head() Python
>> >> > execution itself that's hanging:
>>
>> >> > (1) 2011-09-05 09:16:06.843 "HEAD /api/has_backing?
>> >> > bid=3bc4aeb08e01d3ba4bfab3282d2a198984a4fc1fab2ad9d1e8a39ee3cddd14da"
>> >> > 200 real=227ms cpu=24ms api=8ms overhead=0ms (1 RPC)
>> >> > (2) 2011-09-05 09:15:56.422 "POST /task/import_photo" 500 real=10002ms
>> >> > cpu=33ms api=8ms overhead=0ms (3 RPCs)
>> >> > (3) 2011-09-05 09:15:49.328 "HEAD /api/has_backing?
>> >> > bid=90e794f348ac76520076f5d02bc67c87c8a9185b8d36affe8377e73fe4820703"
>> >> > 200 real=297ms cpu=21ms api=8ms overhead=0ms (1 RPC)
>>
>> >> > This issue is currently 100% reproducible.
>>
>> >> > - Pol
>>
>> >> > --
>> >> > You received this message because you are subscribed to the Google 
>> >> > Groups "Google App Engine" group.
>> >> > To post to this group, send email to [email protected].
>> >> > To unsubscribe from this group, send email to 
>> >> > [email protected].
>> >> > For more options, visit this group 
>> >> > athttp://groups.google.com/group/google-appengine?hl=en.
>>
>> > --
>> > You received this message because you are subscribed to the Google Groups 
>> > "Google App Engine" group.
>> > To post to this group, send email to [email protected].
>> > To unsubscribe from this group, send email to 
>> > [email protected].
>> > For more options, visit this group 
>> > athttp://groups.google.com/group/google-appengine?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups 
> "Google App Engine" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to 
> [email protected].
> For more options, visit this group at 
> http://groups.google.com/group/google-appengine?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Re: [google-appengine] Re: Re-entrancy / deadlock bug with scheduler?

Reply via email to