[google-appengine] Re: GAE Latency & Instance issues

Mark Cummins Fri, 15 Jul 2016 11:04:47 -0700

Thank you so much for this! We also got bitten hard by this problem 
seemingly out of nowhere in the last few weeks. The same request latency 
varying from 200ms to over 10,000ms in some cases, with no obvious reason 
except mysterious gaps in the traces. It's a hard one to debug - there's no 
obvious way to tell that the instances are being CPU starved.


On Thursday, 30 June 2016 08:45:10 UTC+1, Trevor wrote:
>
> Hello ladies and gentlemen, I am here to hopefully draw on some collective 
> knowledge about App Engine and its intricacies. 
>
> For the last two weeks our (my company) site has been experiencing very 
> odd latency issues, and having now tried about 7 different methods of 
> solving it, we are left at exactly where we began: Rising costs with 
> performance that is much lower than previously. 
>
>
> <https://lh3.googleusercontent.com/-THNMrlceFvM/V3TMvgEHvKI/AAAAAAAAQVk/FSZj42sZiCcRHRHmQuvSwM_mGYnbmsrYACLcB/s1600/search-console-latency.png>
>
>
> Essentially what happens is that say 50-60% of our requests are served 
> normally, however the remainder have these extremely long "pauses" in the 
> middle of the trace which is basically during the "processing" phase of the 
> backend handling (after the datastore & memcache data has been retrieved). 
> Here is an example of a single page that in the space of an hour had wildly 
> different loading times for users. The vast majority were the same thing, 
> grab 3 things from memcache and spit out the html retrieved from memcache. 
> That's it... 
>
>
> <https://lh3.googleusercontent.com/-MAFuZARJRh4/V3TG1qjbBbI/AAAAAAAAQT0/xTVp4xN-VAkEpV4d12xzxf9q3UeUFjsQwCLcB/s1600/game-all-latencies.png>
>
> And some individual traces to see what is happening
>
>
> <https://lh3.googleusercontent.com/-HA1u3SS8Y24/V3THA0kDf9I/AAAAAAAAQT8/eI6G77L6uOEU90Ahc0h2DTWVAiFO6zgtgCLcB/s1600/game-trace-1.png>
>
>
> <https://lh3.googleusercontent.com/-qUoJUEVI8Fk/V3THDM4PbZI/AAAAAAAAQUE/vidZSsIRjvAXjPzicg1CKH8s9RQ03g0XwCLcB/s1600/game-trace-2.png>
>
>
> <https://lh3.googleusercontent.com/-87zfyzosAU8/V3THJTYTUPI/AAAAAAAAQUU/DgjHOalOmcUipa_pGkY20F8eVwYa89m0QCLcB/s1600/game-trace-4.png>
>
>
> <https://lh3.googleusercontent.com/-8qUv1v0IJ-U/V3THGAOd_iI/AAAAAAAAQUM/mwtbr1Ona_o0wWX1k-abL4TJzxiiGC8HgCLcB/s1600/game-trace-3.png>
>
>
>
>
> So essentially the troubleshooting steps we took to figure out what was 
> going wrong. 
>
>    - Checked all deployed code over the week preceding and following the 
>    latency spike to ensure we hadn't let some truly horrendous, heavy code 
>    slip through the review process. Everything deployed around that period 
> was 
>    rather light, backend/cms based updates, hardly anything touching 
>    customer-facing requests. 
>    - Appstats, obviously. On the development server (and even unloaded 
>    test versions on the production server) such behavior is not seen. Didn't 
>    help. 
>    - Reducing unnecessary requests (figure 1) - We noticed some of our 
>    ajax-loaded content was creating 2-3 additional, separate requests per 
>    user-page-load, and as such refactored the code to only call those things 
>    when absolutely necessary, and eliminated one altogether. For the most 
>    part, a page load now equals one request. This had no effect on the 
> latency 
>    spikes
>    - Created a separate system that meant that our backend task-based 
>    processing was cut down by 90%, and thus the instance average load was 
>    reduced significantly. This had the opposite effect and average latency 
>    actually climbed, I suspect because of the extensive memcache use with 
>    large chunks of data (tracking what things should be updated by the 
> backend 
>    tasks)
>    - Separated the front end and tasks-back-end into modules/services so 
>    that frontend requests could have 100% instance attention. This had a 
> small 
>    effect, but the spikes are still regularly happening (as seen in the above 
>    traces). 
>    - Played with max_idle_instances  - This had a wonderful effect of 
>    *halving* our daily instance costs, with almost no effect on latency. 
>    When this is set to automatic, we get charged for a huge amount of unused 
>    instances, it actually borders on ludicrous (figure 2) 
>    - Played with max_concurrent_requests (8->16->10) which only served to 
>    make the latency issues worse. 
>    - Hours and hours pouring over logs, traces, dashboard graphs. 
>
>
> * Figure 1 (Since the latency spike on June 5th, we have worked to reduce 
> meaningless requests through API calls or task queuing) *
>
>
> <https://lh3.googleusercontent.com/-IQpPZuv89HE/V3TLipqGFWI/AAAAAAAAQVI/Ud7Y3JURuUAZxkHIxrqTRO5fvwYfHkz7gCLcB/s1600/requests-trend.png>
>
> *Figure 2 (14:40 is when the auto-scaling setting was deployed)*
>
>
> <https://lh3.googleusercontent.com/-PXr6eiEz28E/V3TJ1cXXxcI/AAAAAAAAQUs/PRdTbdBb_uUYBH8AqQNGJE3xGvbw3J50ACLcB/s1600/instances-deploy-time.png>
>
> What I have noticed is when the CPU cycles spike, *so does the latency*. 
> So it would lead in the direction that our requests are starved for CPU 
> time, however now that we have deployed the instance auto scaling (and are 
> paying for an average of around 8 instances vs 4-5 previously), it has not 
> improved the latency, which confuses me considerably. 
>
> If it were all requests that had slowed down, our code would clearly need 
> optimization. If the rise in latency coincided with a change in our 
> frontend processing, it would make sense, but there were only very light 
> backend changes deployed within +/- 2 days of the first latency spike 
> (figure 3)
>
> *Figure 3 - Latency started rising on June 5th*
>
>
> <https://lh3.googleusercontent.com/-ndJeBJmfltI/V3TK6pWWA8I/AAAAAAAAQU8/DzLB6vH_EgwQG4FEvvejqw8w0uYDO4nVQCLcB/s1600/request-latency-spike-date.png>
>  
> Some other images that may assist in understanding the issue:
>
> CPU Cycles (today)
>
>
> <https://lh3.googleusercontent.com/-Tl8nvTKddjk/V3TMWg6v02I/AAAAAAAAQVQ/ZEDNMzpteYo5kh6X10Woo0BvqoZgrVk0ACLcB/s1600/cpu-cycles-deploy-time.png>
>
> CPU Cycles (2 month)
>
>
> <https://lh3.googleusercontent.com/-ynMp9wN2y0s/V3TMgrBeH3I/AAAAAAAAQVc/lLNMBrVvp_UUE2b_gT0txgG_tsnYoFpfACLcB/s1600/cycle-per-sec-no-module.png>
>
>
> Is there anyone out there that can proffer some advice of where to poke, 
> prod or peer next? I have only been using App Engine for 1.5 years now, but 
> this company has been on the platform for about 4 years without these kinds 
> of issues.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/google-appengine.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-appengine/acc02140-db7f-4ba5-9cd2-55184f096d33%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[google-appengine] Re: GAE Latency & Instance issues

Reply via email to