just tested it again :( With 2 threads, 1 second delay and 0.5s both got about 1.3page/s, but it was not dropping.
Maybe I will try the download. But now I really wander why Google can index wikipedia? Acctually many wikipedia sites has the same ip. Google has indexed 120m pages of it. It much has done something. Martin Kuen wrote: > > hi, > > hm :( . . . okay > Can you see the pages/sec value still decreasing? > > Probarbly I am wrong . . . but if you start your crawl with a low number > of > threads (1 or 2), you should immediatly see a value which is very close to > what you'd expect to see - considering the "server.delay" property. If > this > is not true --> I am wrong > > Regarding wikipedia: The english wikipedia has somewhat more than > 1,900,000 > articles now. This number doesn't take into account all the revisions that > occured to them. If I recall it correctly a full dump of the english > wikipedia would be around 600 GB. However, the actual content (the most > up-todate articles) fits into a 2.5 GB download (bz2 compressed). This > download excludes things like images, user discussions, revisions, and so > on. But with this download you're ready-to-go to set-up your own wikipedia > mirror. > > > Cheers > > > On 8/9/07, purpureleaf <[EMAIL PROTECTED]> wrote: >> >> >> Hi, sounds that it is the cause, but I just tested it again. With >> server.delay = 1s doesn't result in 1page/s, almost the same speed. >> confused:( >> I really didn't try to hammer wikipedia, just want to find a site with >> enough pages to test. >> >> So with more than 12M pages of wikipedia, I guess it is almost impossible >> to >> crawl wikipedia on line. >> How does google do this? >> >> >> Martin Kuen wrote: >> > >> > hi there, >> > >> > the property "server.delay" is the delay for one site (e.g. wikipedia). >> > So, >> > if you have a delay of 0.5 you'll fetch 2 pages per second. >> > >> > In my opinion there is something about the fetcher's code that doesn't >> > makes >> > it obey this rule in the very beginning . . . probarbly at start-up 30 >> > threads start immediatly without caring about this setting, which could >> > cause a high pages/sec value in the beginning . . . but then the rule >> is >> > applied correctly and this averaged-value (pages/sec) becomes corrected >> in >> > a >> > step-by-step manner - however I have no evidence for this assumption. >> > >> > If you look around the Fetcher's code (or maybe at the http-plugin - >> don't >> > remember) you'll find a config-property called " >> > protocol.plugin.check.blocking". If you set it to false you'll override >> > the >> > "server.delay" property. The result of this action is that you'll start >> > "hammering" the wikipedia site. >> > I tried to achieve the same by setting the "server.delay" to 0 . . . >> > however >> > . . . things didn't work well (I didn't investigate too much - I found >> the >> > " >> > check.blocking" property, which worked?!). >> > >> > Btw. I propose that you should not start (large) crawls on the >> > wikipedia-sites. The wiki guys don't like it. If you're just running a >> > test >> > and fetch a few pages . . . ok . . . but a crawl of 8 hours . . . hmm . >> .. >> > not just a few pages, right? >> > Furthermore a "server.delay" of 0.5 doesn't really appear polite to me >> . >> . >> > . >> > >> > Ok, so what? If you're interested in indexing the wikipedia articles, >> you >> > can set-up wikipedia on your local computer . . . >> > http://en.wikipedia.org/wiki/Wikipedia:Database_download >> > Then you can run your fetch on your local machine or in your intranet >> and >> > you'll just be limited by the speed of the machine powering the >> mediawiki >> > application. I tried this with the German wikipedia dump and it took a >> > little bit more than 33 hours (AMD Athlon 2600 dualcore, 2GB RAM, >> WinXP, >> > java 1.5, nutch 0.9, ~614.000 articles, ~5.3 pages per second). I >> didn't >> > really care about performance, so I think this could be faster. >> > >> > >> > cheers >> > >> > >> > >> > >> > >> > On 8/9/07, purpureleaf <[EMAIL PROTECTED]> wrote: >> >> >> >> >> >> Hi, thanks for your reply >> >> >> >> Yes I was fetching from wikipedia only, I do this just for test this >> >> slowing >> >> down effect. But not too much I think, 4pages/s, still gets slower and >> >> slower, forever. So the fetcher is supposed to be slower than 1page/s >> >> (per >> >> site) ? >> >> I watched my bandwith, it used less than 20k/s, way less than my >> prodiver >> >> feel easy. >> >> >> >> >> >> >> >> Dennis Kubes-2 wrote: >> >> > >> >> > If this is stalling on only a few fetching tasks check the logs, >> more >> >> > than likely it is fetching many pages from a single site (i.e. >> amazon, >> >> > wikipedia, cnn) and the politeness settings (which you want to keep) >> >> are >> >> > slowing it down. >> >> > >> >> > If it is stalling on many task but a single machines check the >> hardware >> >> > for the machine. We have seed hard disk speed decrease dramatically >> >> > right before they are going to die. On linux do something like >> hdparm >> >> > -tT /dev/hda where hda is the device to check. Average speeds for >> Sata >> >> > should be in the 75MBps range for disk reads and 7000+ range for >> cached >> >> > reads. >> >> > >> >> > Another thing is you may be maxing your bandwidth and your provider >> is >> >> > throttling you? >> >> > >> >> > Dennis KUbes >> >> > >> >> > purpureleaf wrote: >> >> >> Hi, I have worked with nutch for sometime. One thing I am always >> >> curious >> >> >> is >> >> >> when crawling, fetcher's speed will get slower and slower, no >> matter >> >> what >> >> >> configuration I use. >> >> >> My last test get this: ( just one site to make the problem more >> >> simple) >> >> >> >> >> >> OS : winxp >> >> >> java : 1.6.0.2 >> >> >> nutch: 0.9 >> >> >> cpu : AMD 1800 >> >> >> mem : 1G >> >> >> network : 3m adsl >> >> >> >> >> >> site : wikipedia.org >> >> >> threads per site :30 >> >> >> server.delay : 0.5 >> >> >> >> >> >> It starts about 6page/s, but reduce to 4 in some minutes, then get >> >> slower >> >> >> and slower. I have run it for 8 hours, just 2page/s left, and it >> was >> >> till >> >> >> slowing down. >> >> >> But if I stop it and start one other, it returns full speed (then >> >> slows >> >> >> down >> >> >> again). I am ok with 2 pages/s for one site, but I do hope it will >> >> keep >> >> >> that >> >> >> speed. >> >> >> >> >> >> I found there are some guys in this list has the same problem. But >> I >> >> >> can't >> >> >> find an answer. >> >> >> If nutch designed to work this way? >> >> >> >> >> >> Thanks! >> >> > >> >> > >> >> >> >> -- >> >> View this message in context: >> >> >> http://www.nabble.com/Fetcher-get-slower-and-slower-in-one-run-of-crawling-tf4241580.html#a12073371 >> >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> >> >> >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/Fetcher-get-slower-and-slower-in-one-run-of-crawling-tf4241580.html#a12076754 >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/Fetcher-get-slower-and-slower-in-one-run-of-crawling-tf4241580.html#a12083911 Sent from the Nutch - User mailing list archive at Nabble.com.
