On Mon, Jul 16, 2012 at 03:48:52PM -0700, Keith Lofstrom wrote: > My websites have gigabytes of content, but very little of it is > updated frequently. Some search engine spiders, like baidu.com, > (the big Chinese one) seem to crawl my sites continuously, > bringing apache to a standstill. Just before I restarted apache, > netstat said baidu had 70 (out of 224) ports open, some talking > to gigabyte files. Some claim that baidu crawls the web every > 15 minutes or so.
I believe this is the best reference: http://www.robotstxt.org/ If Baidu doesn't follow your robots.txt, you can also do some trickery with mod_rewrite rules in your .htaccess, based on the user agent string. RewriteEngine On RewriteCond %{HTTP_USER_AGENT} Baidu [NC] # UserAgent is Baidu RewriteCond %{REQUEST_URI} !^/$ # Not on the homepage? RewriteRule .* / # Then go there! # RewriteRule .* - [F] # Alternate: return 403 Forbidden Note: This is mostly from memory and not tested. See http://httpd.apache.org/docs/current/mod/mod_rewrite.html _______________________________________________ PLUG mailing list [email protected] http://lists.pdxlinux.org/mailman/listinfo/plug
