hi there, the property "server.delay" is the delay for one site (e.g. wikipedia). So, if you have a delay of 0.5 you'll fetch 2 pages per second.
In my opinion there is something about the fetcher's code that doesn't makes it obey this rule in the very beginning . . . probarbly at start-up 30 threads start immediatly without caring about this setting, which could cause a high pages/sec value in the beginning . . . but then the rule is applied correctly and this averaged-value (pages/sec) becomes corrected in a step-by-step manner - however I have no evidence for this assumption. If you look around the Fetcher's code (or maybe at the http-plugin - don't remember) you'll find a config-property called " protocol.plugin.check.blocking". If you set it to false you'll override the "server.delay" property. The result of this action is that you'll start "hammering" the wikipedia site. I tried to achieve the same by setting the "server.delay" to 0 . . . however . . . things didn't work well (I didn't investigate too much - I found the " check.blocking" property, which worked?!). Btw. I propose that you should not start (large) crawls on the wikipedia-sites. The wiki guys don't like it. If you're just running a test and fetch a few pages . . . ok . . . but a crawl of 8 hours . . . hmm . .. not just a few pages, right? Furthermore a "server.delay" of 0.5 doesn't really appear polite to me . . . Ok, so what? If you're interested in indexing the wikipedia articles, you can set-up wikipedia on your local computer . . . http://en.wikipedia.org/wiki/Wikipedia:Database_download Then you can run your fetch on your local machine or in your intranet and you'll just be limited by the speed of the machine powering the mediawiki application. I tried this with the German wikipedia dump and it took a little bit more than 33 hours (AMD Athlon 2600 dualcore, 2GB RAM, WinXP, java 1.5, nutch 0.9, ~614.000 articles, ~5.3 pages per second). I didn't really care about performance, so I think this could be faster. cheers On 8/9/07, purpureleaf <[EMAIL PROTECTED]> wrote: > > > Hi, thanks for your reply > > Yes I was fetching from wikipedia only, I do this just for test this > slowing > down effect. But not too much I think, 4pages/s, still gets slower and > slower, forever. So the fetcher is supposed to be slower than 1page/s (per > site) ? > I watched my bandwith, it used less than 20k/s, way less than my prodiver > feel easy. > > > > Dennis Kubes-2 wrote: > > > > If this is stalling on only a few fetching tasks check the logs, more > > than likely it is fetching many pages from a single site (i.e. amazon, > > wikipedia, cnn) and the politeness settings (which you want to keep) are > > slowing it down. > > > > If it is stalling on many task but a single machines check the hardware > > for the machine. We have seed hard disk speed decrease dramatically > > right before they are going to die. On linux do something like hdparm > > -tT /dev/hda where hda is the device to check. Average speeds for Sata > > should be in the 75MBps range for disk reads and 7000+ range for cached > > reads. > > > > Another thing is you may be maxing your bandwidth and your provider is > > throttling you? > > > > Dennis KUbes > > > > purpureleaf wrote: > >> Hi, I have worked with nutch for sometime. One thing I am always > curious > >> is > >> when crawling, fetcher's speed will get slower and slower, no matter > what > >> configuration I use. > >> My last test get this: ( just one site to make the problem more simple) > >> > >> OS : winxp > >> java : 1.6.0.2 > >> nutch: 0.9 > >> cpu : AMD 1800 > >> mem : 1G > >> network : 3m adsl > >> > >> site : wikipedia.org > >> threads per site :30 > >> server.delay : 0.5 > >> > >> It starts about 6page/s, but reduce to 4 in some minutes, then get > slower > >> and slower. I have run it for 8 hours, just 2page/s left, and it was > till > >> slowing down. > >> But if I stop it and start one other, it returns full speed (then slows > >> down > >> again). I am ok with 2 pages/s for one site, but I do hope it will keep > >> that > >> speed. > >> > >> I found there are some guys in this list has the same problem. But I > >> can't > >> find an answer. > >> If nutch designed to work this way? > >> > >> Thanks! > > > > > > -- > View this message in context: > http://www.nabble.com/Fetcher-get-slower-and-slower-in-one-run-of-crawling-tf4241580.html#a12073371 > Sent from the Nutch - User mailing list archive at Nabble.com. > >
