Re: web crawling problem

Alan Cox Wed, 16 Jan 2008 06:40:23 -0800

On Wed, 16 Jan 2008 08:24:29 -0500
"Levy, Alan" <[EMAIL PROTECTED]> wrote:


>
>
> I have a server that gets about 1M hits per day. Over the past week,
> this has exploded and the server is using about 80% of the cpu. We
> figure that someone is using a webcrawler since when we analyze the
> tomcat logs, there are thousands of hits from one ip address (every day
> it's a different ip address).

Firstly if the webcrawler isn't malicious but just misguided you should
have a set of logged user agent strings for it that tell you what it is.
If the crawler is polite it will also fetch robots.txt now and then which
you can use to regulate crawling rates and which parts of the site may be
robot crawled.

That may be all that is needed, and will control any future indexing
problems with luck.

----------------------------------------------------------------------
For LINUX-390 subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit
http://www.marist.edu/htbin/wlvindex?LINUX-390

Re: web crawling problem

Reply via email to