My websites have gigabytes of content, but very little of it is
updated frequently.  Some search engine spiders, like baidu.com,
(the big Chinese one) seem to crawl my sites continuously,
bringing apache to a standstill.  Just before I restarted apache,
netstat said baidu had 70 (out of 224) ports open, some talking
to gigabyte files.  Some claim that baidu crawls the web every
15 minutes or so.

robots.txt is simpleminded - it appears that I can only allow or
disallow baidu's spider, rather than throttle it back to one scan
per week or so.  Am I missing something?  Is there a clever but
not-too-complicated way to tell spiders to calm down a bit?

Alternately, are there any tools or standards forthcoming to 
provide a "recent changes" file to the spiders?  It would save
everyone a lot of bandwidth if spiders could check such a file
often, but otherwise not crawl the site for the same old stuff.

I've disallowed baidu in robots.txt for now, and may disallow
a few other spiders until my site responsiveness improves.  It
is a virtual server, not much horsepower.  While I would like
my occasional Chinese users to search it using baidu, it will
do them no good if the site itself is too sluggish to use.

Keith

-- 
Keith Lofstrom          [email protected]         Voice (503)-520-1993
KLIC --- Keith Lofstrom Integrated Circuits --- "Your Ideas in Silicon"
Design Contracting in Bipolar and CMOS - Analog, Digital, and Scan ICs
_______________________________________________
PLUG mailing list
[email protected]
http://lists.pdxlinux.org/mailman/listinfo/plug

Reply via email to