On Wed, Feb 8, 2012 at 10:46 AM, Paul Hartman <[email protected]> wrote: > On Wed, Feb 8, 2012 at 2:55 AM, Pandu Poluan <[email protected]> wrote: >> >> On Jan 27, 2012 11:18 PM, "Paul Hartman" <[email protected]> >> wrote: >>> >> >> ---- >8 snippage >> >>> >>> BTW, the Baidu spider hits my site more than all of the others combined... >>> >> >> Somewhat anecdotal, and definitely veering way off-topic, but Baidu was the >> reason why my company decided to change our webhosting company: Its >> spidering brought our previous webhosting to its knees... >> >> Rgds, > > I wonder if Baidu crawler honors the Crawl-delay directive in robots.txt? > > Or I wonder if Baidu cralwer IPs need to be covered by firewall tarpit rules. > ;)
I don't remember if it respects Crawl-Delay, but it respects forbidden paths, etc. I've never been DDOS'd by Baidu crawlers, but I did get DDOS'd by Yahoo a number of times. Turned out the solution was to disallow access to expensive-to-render pages. If you're using MediaWiki with prettified URLs, this works great: User-agent: * Allow: /mw/images/ Allow: /mw/skins/ Allow: /mw/title.png Disallow: /w/ Disallow: /mw/ Disallow: /wiki/Special: -- :wq

