My websites have gigabytes of content, but very little of it is updated frequently. Some search engine spiders, like baidu.com, (the big Chinese one) seem to crawl my sites continuously, bringing apache to a standstill. Just before I restarted apache, netstat said baidu had 70 (out of 224) ports open, some talking to gigabyte files. Some claim that baidu crawls the web every 15 minutes or so.
robots.txt is simpleminded - it appears that I can only allow or disallow baidu's spider, rather than throttle it back to one scan per week or so. Am I missing something? Is there a clever but not-too-complicated way to tell spiders to calm down a bit? Alternately, are there any tools or standards forthcoming to provide a "recent changes" file to the spiders? It would save everyone a lot of bandwidth if spiders could check such a file often, but otherwise not crawl the site for the same old stuff. I've disallowed baidu in robots.txt for now, and may disallow a few other spiders until my site responsiveness improves. It is a virtual server, not much horsepower. While I would like my occasional Chinese users to search it using baidu, it will do them no good if the site itself is too sluggish to use. Keith -- Keith Lofstrom [email protected] Voice (503)-520-1993 KLIC --- Keith Lofstrom Integrated Circuits --- "Your Ideas in Silicon" Design Contracting in Bipolar and CMOS - Analog, Digital, and Scan ICs _______________________________________________ PLUG mailing list [email protected] http://lists.pdxlinux.org/mailman/listinfo/plug
