First, has there been any configuration changes shortly before the problem began? The first rule is "look for stupidity", as in an error in configuration causing a self-DOS. Many of us have done that to ourselves, to our embarrassment. If not, go with Tim's suggestion and also look at squid's logs. Are you getting requests, but no full session (syn flood)?
I'm on your site periodically. It's normally smoothly running, since you went with Linode. The site is overall well behaved. However, it is one that could easily become the target of a script kiddie. So, do you have SYN cookies turned on? I'm a sysadmin/netadmin, but I'm a bit colored from my information security experience. Hence, I always have to re-remind myself that stupidity is the most frequent cause of a problem, malicious intent the last. The large number of httpd daemons can be php hits or SYN flooding, in a non-squid environment or even with a creatively crafted attack. The latter is beyond rare for anything non-super profile in nature (think Fortune 500 and government scale for that). But, the most common is a burst of intra-cranial flatulence or a case of fat fingers. So, look again at the logs and processes during the slug convention. Look from Tim's suggested perspective. If you can't find anything there, look closer at squid and connection based events. When working for the US DoD, our most common DOS was self-inflicted. In an environment where we were incessantly having DDOS, general DOS and every other form of attack attempted. Two, inflicted by my own humble fat fingers. :/ On Apr 21, 2013, at 11:53 PM, Tim Starling wrote: > On 21/04/13 05:29, David Gerard wrote: >> So where would I start looking to work out what's going on? > > If there is any kind of site issue at WMF, I usually start with > Ganglia. It does take some practise to be able to read it correctly, > but it gives you information far more quickly than just about anything > else. My notes on WMF incident response give some hints about how to > use it, as well as discussing some other tools: > > https://wikitech.wikimedia.org/wiki/Incident_response > > If the problem seems to be downstream of MediaWiki, then profiling is > usually the next thing to look at. Wikipedia has been using DIY > profiling to diagnose site performance issues since it was on a single > server. > >> * Sometimes it isn't, e.g. this afternoon when the site was running >> like a slug and load average was 0.8 with nothing amiss in top. > > Processes in the "S" state do not contribute to the load average, > whether or not users are waiting for them. For example, PHP may be > waiting for Lucene. Try the section in the incident response notes > under "slow backend service". > > -- Tim Starling > > > _______________________________________________ > MediaWiki-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/mediawiki-l _______________________________________________ MediaWiki-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
