Hi Gilles, Thanks for digging up all these graphs. This is thorough work and truly excellent preparation, kudos!
I agree that we seem to be doing okay so far, indeed. On Fri, May 02, 2014 at 11:38:29AM +0200, Gilles Dubuc wrote: > Are these the right graphs to look at to see if these APIs aren't going > nuts and won't take down the servers when we release to bigger wikis? > > On a related note, is this the right dashboard for API servers? > http://ganglia.wikimedia.org/latest/?r=month&cs=&ce=&m=cpu_report&s=by+name&c=API+application+servers+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 Yes, these are the right graphs and the Ganglia cluster "API Application servers eqiad" is the one to monitor indeed. From that group, the most interesting metrics would be the ap_rps (Apache Requests per Second) and ap_busy_workers: http://ganglia.wikimedia.org/latest/stacked.php?m=ap_rps&c=API%20application%20servers%20eqiad&r=month&st=1399040178&host_regex= API is being served from the main Varnish clusters ("Text caches eqiad/esams/ulsfo"), so you wouldn't have a separate group to monitor there and the data will incorporate a lot of noise. The frontend.client_req and varnish.client_req metrics would be the ones to monitor there. Also, considering the nature of the feature and the need for newly generated thumbs (AIUI) we should watch carefully: a) Swift, in particular rps, b) Imagescalers, in particular rps, c) Front/back Upload Varnishes. All these are at Ganglia's Media Storage view: https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&tab=v&vn=Media+storage Finally, this falls a bit outside of ops, but it ties closely to the discussion about cached API responses, as it involves the (lack of) CDN for these requests: we should assess the effect that the feature has on frontend metrics, NavigationTiming such. Gdash has a dashboard with some high-level graphs for that that I don't think are going to be very useful.My understanding is that you were also doing some work in this area already, though? I vaguely remember some NavTiming/EventLogging work from the Multimedia team, is this correct? Thanks, Faidon _______________________________________________ Multimedia mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/multimedia
