[google-appengine] Re: Downtime or memcache issue? Latency spike (up to 5/8 times) and memcache write error

Cristian Marastoni Tue, 21 Jul 2015 09:57:46 -0700

Thank Nick and troberti. Great suggestions indeed. And in fact I had both: app 
stat and cloud trace. App stats record a bit more stuff but I found cloud trace 
a good compromise for the rough analysis.


I investigated a bit more and I found 4 major problem:
1 - memcached layer due to the usage pattern in ndb if start to fail or slow 
down will case a huge impact (even if batched). But that should be expected. I 
need to find if mc is down or slow and disable the usage in ndb context 
probably during that period.
2 - my app had a bug, due to the usage of a Reader/Writer lock I wrote (I 
tested it with up to 50 threads but it seem that the test is not always 
enough). I removed that part and the threads started to work better (I will ask 
suggestion for an usage pattern in a different thread)
3 - while I used a lot ndb.multi, async call and I cached in practice all the 
static data in memory (when the app start) an F1 machine wasn’t enough to 
granted more than 3/4 concurrent threads while keeping the latency under 1/1.5 
seconds. (now it is time to understand why, because the code of the handlers is 
quite simple to be honest)
4 - I was mixing some slow work from task queue to the same instances and 
because they get the thread for a while, they could slow down a lot the other 
handler in queue
5 - I noticed that sometime the datastore transactions and put have great spike 
(in term of time). Normally they take 0.2 / 0.5 second and something they need 
6 second to complete. I’m worried a bit about the entity ids. There are many 
entity (that belong to different ancestor) that use the same name, maybe are 
used or close to the same db shard.


As usual, we learn by doing thing :)

PS: troberti, are you from firigames ? So I have to thank you for both Phoenix 
HD and for your great BTree library! I use it a lot for my leaderboards ;)


> On 21 Jul 2015, at 17:51, Nick (Cloud Platform Support) <[email protected]> 
> wrote:
> 
> Great point, troberti. Very true!
> 
> On Tuesday, July 21, 2015 at 12:23:46 PM UTC+3, troberti wrote:
> Instead of using AppStats, you should just use  Cloud Trace 
> <https://console.developers.google.com/project/_/clouddev/trace> instead. 
> Afaik, it does everything AppStats does, but without the overhead.
> 
> On Monday, July 20, 2015 at 10:16:35 PM UTC+2, Nick (Cloud Platform Support) 
> wrote:
> Hey Christian,
> 
> If you don't have it installed on your app it might be too late to diagnose 
> the past issue, but using appstats (java 
> <https://cloud.google.com/appengine/docs/java/tools/appstats> | python 
> <https://cloud.google.com/appengine/docs/python/tools/appstats>) you could 
> determine exactly where the latency occurred, whether it was the memcache 
> calls or the issue of instances (and instance class) themselves. Given that 
> the instance class increase seems to have solved the issue, it could have 
> even been a mix.
> 
> Best wishes,
> 
> Nick
> 
> On Friday, July 17, 2015 at 4:11:22 AM UTC-4, Cristian Marastoni wrote:
> Hi Nick,
> 
> thank for the response.
> I know that the memcache response time is not covered by SLA (at least for 
> best effort,  dedicated probably have), however yesterday it was very high 
> indeed.  Because I'm using ndb and I'm using the default memcache policy my 
> server were very slow (at a certain time for sure was terribile). 
> Honestly speaking yesterday was the first day we started having such an high 
> load, 20/30 request per seconds (like x 10 in respect to the days before).The 
> front-end tier was F1, probably that servers weren't able to cope with too 
> many concurrent request (module configure to handle up to 10) and the 
> response time couldn't be better. What really surprised me is that there was 
> 10 server up to handle that load (like at most 2 concurrent request at most 
> per server).
> I'm still investigating about that, potentially there is something wrong in 
> my code. Today I changed the tier to F2 and that (obviously) is much better.
> 
> On Friday, July 17, 2015 at 1:24:20 AM UTC+2, Nick (Cloud Platform Support) 
> wrote:
> Hey Cristian,
> 
> This is understandable given that Memcache is shared using datacenter 
> resources across all apps. It's likely that apps in the same location as 
> yours also experienced the same latency for that period. You can read about 
> Memcache in the docs <https://cloud.google.com/appengine/docs/java/memcache/> 
> to find that there is not an SLA for response times. It's likely that the 
> response time was still significantly faster than Datastore, Cloud SQL, your 
> own MySQL instance, etc., so there's that to keep in mind.
> 
> Usually, if an issue is large enough to violate some SLA, or if many apps are 
> affected, a status alert will go out at status.cloud.google.com 
> <http://status.cloud.google.com/>, although in this case, the latency you saw 
> was not enough to trigger a detailed issue report. 
> 
> If you have any further questions about memcache, feel free to ask, and also 
> to consult the docs to learn more.
> 
> Regards,
> 
> Nick
> 
> On Thursday, July 16, 2015 at 3:40:31 AM UTC-4, Cristian Marastoni wrote:
> My app is experiencing latency problem (up to 5 times). I noticed many 
> problem also accessing and writing memcache stuff.
> Are there any issue reported?

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/google-appengine.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-appengine/F9490EF4-3686-4111-87FE-4D51747FA646%40reludo.com.
For more options, visit https://groups.google.com/d/optout.

[google-appengine] Re: Downtime or memcache issue? Latency spike (up to 5/8 times) and memcache write error

Reply via email to