On Wed, Feb 8, 2012 at 10:46 AM, Paul Hartman
<paul.hartman+gen...@gmail.com> wrote:
> On Wed, Feb 8, 2012 at 2:55 AM, Pandu Poluan <pa...@poluan.info> wrote:
>>
>> On Jan 27, 2012 11:18 PM, "Paul Hartman" <paul.hartman+gen...@gmail.com>
>> wrote:
>>>
>>
>> ---- >8 snippage
>>
>>>
>>> BTW, the Baidu spider hits my site more than all of the others combined...
>>>
>>
>> Somewhat anecdotal, and definitely veering way off-topic, but Baidu was the
>> reason why my company decided to change our webhosting company: Its
>> spidering brought our previous webhosting to its knees...
>>
>> Rgds,
>
> I wonder if Baidu crawler honors the Crawl-delay directive in robots.txt?
>
> Or I wonder if Baidu cralwer IPs need to be covered by firewall tarpit rules. 
> ;)

I don't remember if it respects Crawl-Delay, but it respects forbidden
paths, etc. I've never been DDOS'd by Baidu crawlers, but I did get
DDOS'd by Yahoo a number of times. Turned out the solution was to
disallow access to expensive-to-render pages. If you're using
MediaWiki with prettified URLs, this works great:

User-agent: *
Allow: /mw/images/
Allow: /mw/skins/
Allow: /mw/title.png
Disallow: /w/
Disallow: /mw/
Disallow: /wiki/Special:

-- 
:wq

Reply via email to