On Mon, Jul 16, 2012 at 03:48:52PM -0700, Keith Lofstrom wrote:
> My websites have gigabytes of content, but very little of it is
> updated frequently.  Some search engine spiders, like baidu.com,
> (the big Chinese one) seem to crawl my sites continuously,
> bringing apache to a standstill.  Just before I restarted apache,
> netstat said baidu had 70 (out of 224) ports open, some talking
> to gigabyte files.  Some claim that baidu crawls the web every
> 15 minutes or so.

I believe this is the best reference: http://www.robotstxt.org/
If Baidu doesn't follow your robots.txt, you can also do some trickery with
mod_rewrite rules in your .htaccess, based on the user agent string.  

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} Baidu [NC]  # UserAgent is Baidu
RewriteCond %{REQUEST_URI} !^/$            # Not on the homepage?
RewriteRule .* /                           # Then go there!
# RewriteRule .* - [F]                     # Alternate: return 403 Forbidden

Note: This is mostly from memory and not tested.  See
http://httpd.apache.org/docs/current/mod/mod_rewrite.html
_______________________________________________
PLUG mailing list
[email protected]
http://lists.pdxlinux.org/mailman/listinfo/plug

Reply via email to