Most people know how the GAE scheduler is supposed to work. The problem is, 
it does not work as advertised under a number of conditions. 
Now I am absolutely certain that there is at least one configuration out 
there where the GAE scheduler performs exactly as advertised, and that is 
in Google's test lab. 

Making users wait for an instance to start before their request can be 
served while idle instances are standing by is a failure condition. Sure, 
there might be some large scale cases where this behavior is both rare and 
desirable among hundreds of instances serving millions of requests. 
Unfortunately nobody starting out on GAE would ever get to that point when 
every 5th pageload takes 15+ seconds to complete. The combination of 
aggressive recycling of instances and serving user facing requests with 
cold instance startups is pathological when it takes as long as Java does 
to start. For simple websites with one request per page-load, this is much 
less of an issue. For interactive sites with multiple moving parts, it can 
be a nightmare.

As far as gathering evidence:
Logs are tagged with what instance is serving them, so you can look at the 
request that takes 15s and see if it belonged to a resident instance.
You can also log when an instance is initializing, making identification of 
cold starts obvious.
You can also tell by the spinning GIF on your cursor when you access your 
site that your request is waiting on a Java instance to start.

This is no longer an issue for our site. This bug does not exist on Go 
because Go instances cold start in under 100ms. We moved the complicated 
stuff off GAE to AWS rather than port it. We ported the simple stuff to Go 
from Java and now our app performs great. We have a python stub to serve 
requests to the Full Text Search API and Cloud SQL, since those two 
services are not available on Go yet. Hopefully things improve with Go 1.1

You have to roll with the punches.

On Friday, April 5, 2013 9:23:17 AM UTC-7, Vinny P wrote:
>
>
> On Friday, April 5, 2013 10:24:26 AM UTC-5, Jeff Schnitzer wrote:
>
>> I think what people are looking for is:  What combination of settings 
>> will prevent users from seeing cold starts, or at least decrease the 
>> probability down to 4 or 5 sigma?  There doesn't appear to be an 
>> answer, not even "run ten thousand resident instances". 
>>
>>
> Believe me, I sympathize with you, Aswath, Cesium, etc. I'm a big Java on 
> GAE guy (and so are my clients) and I frequently run head first into this 
> issue. And of course, Java makes it particularly hard due to its overhead 
>  The only thing we can do is to run continuous testing, repeated inspection 
> of logs, A/B test adjusting the latency slider and resident instances, etc. 
> To be fair, I work for a large company alongside a ton of very bright 
> people; we can afford to spend manpower to continually optimize our 
> applications. Not everybody can do the same.
>
> There's just no magic bullet here. As the famous Spolsky quote goes, "all 
> abstractions are leaky". Is there room to criticize GAE? Yes, obviously, 
> and I have a huge list of issues I would love fixed (Google, please give us 
> incoming email/xmpp on custom domains, thanks). But I also have clients 
> that complain to me about Heroku (especially after they were discovered 
> lying to users about how routing worked, that day was just nonstop 
> complaining), and other PAAS. If I had the solution to everything, I'd be 
> selling it to clients. The only thing that works is continuous monitoring.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/google-appengine?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to