On Wed, Feb 8, 2012 at 12:17 PM, Pandu Poluan <[email protected]> wrote: > > On Feb 8, 2012 10:57 PM, "Michael Mol" <[email protected]> wrote: >> >> On Wed, Feb 8, 2012 at 10:46 AM, Paul Hartman >> <[email protected]> wrote: >> > On Wed, Feb 8, 2012 at 2:55 AM, Pandu Poluan <[email protected]> wrote: >> >> >> >> On Jan 27, 2012 11:18 PM, "Paul Hartman" >> >> <[email protected]> >> >> wrote: >> >>> >> >> >> >> ---- >8 snippage >> >> >> >>> >> >>> BTW, the Baidu spider hits my site more than all of the others >> >>> combined... >> >>> >> >> >> >> Somewhat anecdotal, and definitely veering way off-topic, but Baidu was >> >> the >> >> reason why my company decided to change our webhosting company: Its >> >> spidering brought our previous webhosting to its knees... >> >> >> >> Rgds, >> > >> > I wonder if Baidu crawler honors the Crawl-delay directive in >> > robots.txt? >> > >> > Or I wonder if Baidu cralwer IPs need to be covered by firewall tarpit >> > rules. ;) >> >> I don't remember if it respects Crawl-Delay, but it respects forbidden >> paths, etc. I've never been DDOS'd by Baidu crawlers, but I did get >> DDOS'd by Yahoo a number of times. Turned out the solution was to >> disallow access to expensive-to-render pages. If you're using >> MediaWiki with prettified URLs, this works great: >> >> User-agent: * >> Allow: /mw/images/ >> Allow: /mw/skins/ >> Allow: /mw/title.png >> Disallow: /w/ >> Disallow: /mw/ >> Disallow: /wiki/Special: >> > > *slaps forehead* > > Now why didn't I think of that before?! > > Thanks for reminding me!
I didn't think of it until I watched the logs live and saw it crawling through page histories during one of the events. MediaWiki stores page histories as a series of diffs from the current version, so it has to assemble old versions by reverse-applying the diffs of all the made to the page between the current version and the version you're asking for. if you have a bot retrieve ten versions of a page that has ten revisions, that's 210 reverse diff operations. Grabbing all versions of a page with 20 revisions would result in over 1500 reverse diffs. My 'hello world' page has over five hundred revisions. So the page history crawling was pretty quickly obvious... -- :wq

