Re: Fetcher get slower and slower in one run of crawling

purpureleaf Thu, 09 Aug 2007 10:22:21 -0700

Hi, sounds that it is the cause, but I just tested it again. With
server.delay = 1s doesn't result in 1page/s, almost the same speed.
confused:(
I really didn't try to hammer wikipedia, just want to find a site with
enough pages to test.


So with more than 12M pages of wikipedia, I guess it is almost impossible to
crawl wikipedia on line.
How does google do this?


Martin Kuen wrote:
> 
> hi there,
> 
> the property "server.delay" is the delay for one site (e.g. wikipedia).
> So,
> if you have a delay of 0.5 you'll fetch 2 pages per second.
> 
> In my opinion there is something about the fetcher's code that doesn't
> makes
> it obey this rule in the very beginning . . . probarbly at start-up 30
> threads start immediatly without caring about this setting, which could
> cause a high pages/sec value in the beginning . . . but then the rule is
> applied correctly and this averaged-value (pages/sec) becomes corrected in
> a
> step-by-step manner - however I have no  evidence for this assumption.
> 
> If you look around the Fetcher's code (or maybe at the http-plugin - don't
> remember) you'll find a config-property called "
> protocol.plugin.check.blocking". If you set it to false you'll override
> the
> "server.delay" property. The result of this action is that you'll start
> "hammering" the wikipedia site.
> I tried to achieve the same by setting the "server.delay" to 0 . . .
> however
> . . . things didn't work well (I didn't investigate too much - I found the
> "
> check.blocking" property, which worked?!).
> 
> Btw. I propose that you should not start (large) crawls on the
> wikipedia-sites. The wiki guys don't like it. If you're just running a
> test
> and fetch a few pages . . . ok . . . but a crawl of 8 hours . . . hmm . ..
> not just a few pages, right?
> Furthermore a "server.delay" of 0.5 doesn't really appear polite to me . .
> .
> 
> Ok, so what? If you're interested in indexing the wikipedia articles, you
> can set-up wikipedia on your local computer . . .
> http://en.wikipedia.org/wiki/Wikipedia:Database_download
> Then you can run your fetch on your local machine or in your intranet and
> you'll just be limited by the speed of the machine powering the mediawiki
> application. I tried this with the German wikipedia dump and it took a
> little bit more than 33 hours (AMD Athlon 2600 dualcore, 2GB RAM, WinXP,
> java 1.5, nutch 0.9, ~614.000 articles, ~5.3 pages per second). I didn't
> really care about performance, so I think this could be faster.
> 
> 
> cheers
> 
> 
> 
> 
> 
> On 8/9/07, purpureleaf <[EMAIL PROTECTED]> wrote:
>>
>>
>> Hi, thanks for your reply
>>
>> Yes I was fetching from wikipedia only, I do this just for test this
>> slowing
>> down effect. But not too much I think, 4pages/s, still gets slower and
>> slower, forever. So the fetcher is supposed to be slower than 1page/s
>> (per
>> site) ?
>> I watched my bandwith, it used less than 20k/s, way less than my prodiver
>> feel easy.
>>
>>
>>
>> Dennis Kubes-2 wrote:
>> >
>> > If this is stalling on only a few fetching tasks check the logs, more
>> > than likely it is fetching many pages from a single site (i.e. amazon,
>> > wikipedia, cnn) and the politeness settings (which you want to keep)
>> are
>> > slowing it down.
>> >
>> > If it is stalling on many task but a single machines check the hardware
>> > for the machine.  We have seed hard disk speed decrease dramatically
>> > right before they are going to die.  On linux do something like hdparm
>> > -tT /dev/hda where hda is the device to check.  Average speeds for Sata
>> > should be in the 75MBps range for disk reads and 7000+ range for cached
>> > reads.
>> >
>> > Another thing is you may be maxing your bandwidth and your provider is
>> > throttling you?
>> >
>> > Dennis KUbes
>> >
>> > purpureleaf wrote:
>> >> Hi, I have worked with nutch for sometime. One thing I am always
>> curious
>> >> is
>> >> when crawling, fetcher's speed will get slower and slower, no matter
>> what
>> >> configuration I use.
>> >> My last test get this: ( just one site to make the problem more
>> simple)
>> >>
>> >> OS : winxp
>> >> java : 1.6.0.2
>> >> nutch: 0.9
>> >> cpu : AMD 1800
>> >> mem : 1G
>> >> network : 3m adsl
>> >>
>> >> site : wikipedia.org
>> >> threads per site :30
>> >> server.delay : 0.5
>> >>
>> >> It starts about 6page/s, but reduce to 4 in some minutes, then get
>> slower
>> >> and slower. I have run it for 8 hours, just 2page/s left, and it was
>> till
>> >> slowing down.
>> >> But if I stop it and start one other, it returns full speed (then
>> slows
>> >> down
>> >> again). I am ok with 2 pages/s for one site, but I do hope it will
>> keep
>> >> that
>> >> speed.
>> >>
>> >> I found there are some guys in this list has the same problem. But I
>> >> can't
>> >> find an answer.
>> >> If nutch designed to work this way?
>> >>
>> >> Thanks!
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Fetcher-get-slower-and-slower-in-one-run-of-crawling-tf4241580.html#a12073371
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Fetcher-get-slower-and-slower-in-one-run-of-crawling-tf4241580.html#a12076754
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Fetcher get slower and slower in one run of crawling

Reply via email to