Re: [gentoo-user] Google privacy changes

Michael Mol Wed, 08 Feb 2012 10:28:53 -0800

On Wed, Feb 8, 2012 at 12:17 PM, Pandu Poluan <[email protected]> wrote:
>
> On Feb 8, 2012 10:57 PM, "Michael Mol" <[email protected]> wrote:
>>
>> On Wed, Feb 8, 2012 at 10:46 AM, Paul Hartman
>> <[email protected]> wrote:
>> > On Wed, Feb 8, 2012 at 2:55 AM, Pandu Poluan <[email protected]> wrote:
>> >>
>> >> On Jan 27, 2012 11:18 PM, "Paul Hartman"
>> >> <[email protected]>
>> >> wrote:
>> >>>
>> >>
>> >> ---- >8 snippage
>> >>
>> >>>
>> >>> BTW, the Baidu spider hits my site more than all of the others
>> >>> combined...
>> >>>
>> >>
>> >> Somewhat anecdotal, and definitely veering way off-topic, but Baidu was
>> >> the
>> >> reason why my company decided to change our webhosting company: Its
>> >> spidering brought our previous webhosting to its knees...
>> >>
>> >> Rgds,
>> >
>> > I wonder if Baidu crawler honors the Crawl-delay directive in
>> > robots.txt?
>> >
>> > Or I wonder if Baidu cralwer IPs need to be covered by firewall tarpit
>> > rules. ;)
>>
>> I don't remember if it respects Crawl-Delay, but it respects forbidden
>> paths, etc. I've never been DDOS'd by Baidu crawlers, but I did get
>> DDOS'd by Yahoo a number of times. Turned out the solution was to
>> disallow access to expensive-to-render pages. If you're using
>> MediaWiki with prettified URLs, this works great:
>>
>> User-agent: *
>> Allow: /mw/images/
>> Allow: /mw/skins/
>> Allow: /mw/title.png
>> Disallow: /w/
>> Disallow: /mw/
>> Disallow: /wiki/Special:
>>
>
> *slaps forehead*
>
> Now why didn't I think of that before?!
>
> Thanks for reminding me!


I didn't think of it until I watched the logs live and saw it crawling
through page histories during one of the events. MediaWiki stores page
histories as a series of diffs from the current version, so it has to
assemble old versions by reverse-applying the diffs of all the made to
the page between the current version and the version you're asking
for. if you have a bot retrieve ten versions of a page that has ten
revisions, that's 210 reverse diff operations. Grabbing all versions
of a page with 20 revisions would result in over 1500 reverse diffs.
My 'hello world' page has over five hundred revisions.

So the page history crawling was pretty quickly obvious...

-- 
:wq

Re: [gentoo-user] Google privacy changes

Reply via email to