On Wed, Feb 8, 2012 at 10:46 AM, Paul Hartman <paul.hartman+gen...@gmail.com> wrote: > On Wed, Feb 8, 2012 at 2:55 AM, Pandu Poluan <pa...@poluan.info> wrote: >> >> On Jan 27, 2012 11:18 PM, "Paul Hartman" <paul.hartman+gen...@gmail.com> >> wrote: >>> >> >> ---- >8 snippage >> >>> >>> BTW, the Baidu spider hits my site more than all of the others combined... >>> >> >> Somewhat anecdotal, and definitely veering way off-topic, but Baidu was the >> reason why my company decided to change our webhosting company: Its >> spidering brought our previous webhosting to its knees... >> >> Rgds, > > I wonder if Baidu crawler honors the Crawl-delay directive in robots.txt? > > Or I wonder if Baidu cralwer IPs need to be covered by firewall tarpit rules. > ;)
I don't remember if it respects Crawl-Delay, but it respects forbidden paths, etc. I've never been DDOS'd by Baidu crawlers, but I did get DDOS'd by Yahoo a number of times. Turned out the solution was to disallow access to expensive-to-render pages. If you're using MediaWiki with prettified URLs, this works great: User-agent: * Allow: /mw/images/ Allow: /mw/skins/ Allow: /mw/title.png Disallow: /w/ Disallow: /mw/ Disallow: /wiki/Special: -- :wq