On Wed, 16 Jan 2008 08:24:29 -0500 "Levy, Alan" <[EMAIL PROTECTED]> wrote:
> > > I have a server that gets about 1M hits per day. Over the past week, > this has exploded and the server is using about 80% of the cpu. We > figure that someone is using a webcrawler since when we analyze the > tomcat logs, there are thousands of hits from one ip address (every day > it's a different ip address). Firstly if the webcrawler isn't malicious but just misguided you should have a set of logged user agent strings for it that tell you what it is. If the crawler is polite it will also fetch robots.txt now and then which you can use to regulate crawling rates and which parts of the site may be robot crawled. That may be all that is needed, and will control any future indexing problems with luck. ---------------------------------------------------------------------- For LINUX-390 subscribe / signoff / archive access instructions, send email to [EMAIL PROTECTED] with the message: INFO LINUX-390 or visit http://www.marist.edu/htbin/wlvindex?LINUX-390
