First, has there been any configuration changes shortly before the problem 
began? 
The first rule is "look for stupidity", as in an error in configuration causing 
a self-DOS. Many of us have done that to ourselves, to our embarrassment.
If not, go with Tim's suggestion and also look at squid's logs. Are you getting 
requests, but no full session (syn flood)?

I'm on your site periodically. It's normally smoothly running, since you went 
with Linode. 
The site is overall well behaved.
However, it is one that could easily become the target of a script kiddie.
So, do you have SYN cookies turned on?

I'm a sysadmin/netadmin, but I'm a bit colored from my information security 
experience. Hence, I always have to re-remind myself that stupidity is the most 
frequent cause of a problem, malicious intent the last. 

The large number of httpd daemons can be php hits or SYN flooding, in a 
non-squid environment or even with a creatively crafted attack. The latter is 
beyond rare for anything non-super profile in nature (think Fortune 500 and 
government scale for that).
But, the most common is a burst of intra-cranial flatulence or a case of fat 
fingers.
So, look again at the logs and processes during the slug convention. Look from 
Tim's suggested perspective. If you can't find anything there, look closer at 
squid and connection based events. 
When working for the US DoD, our most common DOS was self-inflicted. In an 
environment where we were incessantly having DDOS, general DOS and every other 
form of attack attempted.
Two, inflicted by my own humble fat fingers.  :/

On Apr 21, 2013, at 11:53 PM, Tim Starling wrote:

> On 21/04/13 05:29, David Gerard wrote:
>> So where would I start looking to work out what's going on?
> 
> If there is any kind of site issue at WMF, I usually start with
> Ganglia. It does take some practise to be able to read it correctly,
> but it gives you information far more quickly than just about anything
> else. My notes on WMF incident response give some hints about how to
> use it, as well as discussing some other tools:
> 
> https://wikitech.wikimedia.org/wiki/Incident_response
> 
> If the problem seems to be downstream of MediaWiki, then profiling is
> usually the next thing to look at. Wikipedia has been using DIY
> profiling to diagnose site performance issues since it was on a single
> server.
> 
>> * Sometimes it isn't, e.g. this afternoon when the site was running
>> like a slug and load average was 0.8 with nothing amiss in top.
> 
> Processes in the "S" state do not contribute to the load average,
> whether or not users are waiting for them. For example, PHP may be
> waiting for Lucene. Try the section in the incident response notes
> under "slow backend service".
> 
> -- Tim Starling
> 
> 
> _______________________________________________
> MediaWiki-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l


_______________________________________________
MediaWiki-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

Reply via email to