> > Are you fetching from same host? If url list are concentrated in few > hosts, because of politeness setting, a lot of time will spend on > waiting. >
Yes I am fetching from the same host. These are my nutch-site settings which should hopefully override the politeness settings: fetcher.server.delay 0.01 fetcher.threads.fetch 10 fetcher.threads.per.host 50 fetcher.store.content true fetcher.parse true db.ignore.internal.links false db.ignore.external.links true db.max.outlinks.per.page -1 file.content.limit -1 http.content.limit -1 http.useHttp11 true http.redirect.max 5 http.timeout 10000 > > > > Hi Kevin, > > > > Thanks for your reply. > > > > I haven't checked the other crawl stages as the bottleneck for me is the > > nutch fetch (mine's configured to do parsing and storing content). Also, > > I'm fetching from an F5 load-balanced pair of domino lotus notes intranet > > servers so the network speed is not a factor. > > > > The actual time "fetching/parsing" urls seems quick, around 1000 a minute > > with 20 threads, but then there's some processing after that, which seems > > to increase exponentially as the list gets bigger. I timed some of the > > fetches, and you can see here, how much the time increases ater I get to > > 20000 urls: > > > > bin/nutch readseg -list -dir crawl/segments/ > > NAME GENERATED FETCHER START FETCHER END > > FETCHED PARSED *ACTUAL TIME ELAPSED* > > 20080924121754 1 2008-09-24T12:18:04 2008-09-24T12:18:04 > > 1 1 > > 20080924121816 58 2008-09-24T12:18:47 2008-09-24T12:18:53 > > 58 25 > > 20080924121920 363 2008-09-24T12:19:40 2008-09-24T12:20:02 > > 379 214 > > 20080924122034 1987 2008-09-24T12:21:14 2008-09-24T12:22:50 > > 2085 1434 > > 20080924122344 4494 2008-09-24T12:24:25 2008-09-24T12:31:14 > > 4598 4318 > > 20080924125007 6802 2008-09-24T12:51:02 2008-09-24T13:01:32 > > 6874 6462 => 14:43.16 elapsed > > 20080924131404 8170 2008-09-24T13:15:04 2008-09-24T13:22:29 > > 8317 7802 => 13:26.59 elapsed > > 20080924132912 21065 2008-09-24T13:30:28 2008-09-24T13:42:49 > > 21081 19699 => 43:42.11 elapsed > > 20080924141603 26205 2008-09-24T14:18:42 2008-09-24T14:39:26 > > 26327 24649 => 1:50:29 elapsed > > 20080924161725 10998 ? ? ? ? > > > > > > Can anyone tell me what is going on in the fetch stage after the urls have > > been fetched and parsed please? Can this be speeded up in any way? > > > > Thanks, > > > > Ed. > > > > > > > > > > > Date: Tue, 23 Sep 2008 13:57:14 -0700 > > > From: [EMAIL PROTECTED] > > > To: [email protected] > > > Subject: Re: benchmarking > > > > > > I am using individual commands called from Java. I simply used Crawl.java > > > as > > > a starting point. Because I'm not using nutch for search I have already > > > eliminated quite a few things such as building indexes and inverting > > > links. > > > All the fetching and subsequent lengthy operations are happening in > > > Fetcher.fetch(segments, threads). My next step is to hack into that and > > > see > > > what's going on. When I crank up the logs I see a ton of map/reduce stuff > > > happening. > > > > > > On Tue, Sep 23, 2008 at 12:54 PM, Doğacan Güney <[EMAIL PROTECTED]> wrote: > > > > > > > On Tue, Sep 23, 2008 at 8:51 PM, Kevin MacDonald <[EMAIL PROTECTED]> > > > > wrote: > > > > > Some additional info. We are running Nutch on one of Amazon's EC2 > > > > > small > > > > > instances, which has the equivalent CPU capacity of a 1.0-1.2 GHz 2007 > > > > > Opteron or 2007 Xeon processor with 1.7GB of RAM. > > > > > To crawl 35,000 Urls to a depth of 1 here's a breakdown of processing > > > > times > > > > > from the log file: > > > > > - Injecting urls into the Crawl db: 1 min. > > > > > - Fetching: 46min > > > > > - Additional processing (unknown): 66min > > > > > > > > > > Fetching is happening at a rate of about 760 Urls/min, or 1.1 million > > > > > per > > > > > day. The big block of additional processing happens after the last > > > > > fetch. > > > > I > > > > > don't really know what Nutch is doing during that time. Parsing > > > > > perhaps? > > > > I > > > > > would really like to know because that is killing my performance. > > > > > > > > > > > > > Are you using "crawl" command? If you are serious about nutch, I would > > > > suggest > > > > that you use individual commands (inject/fetch/parse/etc). This should > > > > give > > > > you > > > > a better idea of what is taking so long. > > > > > > > > > Kevin > > > > > > > > > > On Tue, Sep 23, 2008 at 10:14 AM, Kevin MacDonald <[EMAIL PROTECTED] > > > > >wrote: > > > > > > > > > >> Edward, > > > > >> I have been doing Crawl operations as opposed to the Fetch operation > > > > you're > > > > >> doing below. I think I am a little unclear on the difference. Since > > > > you're > > > > >> specifying a segment path when doing a Fetch does that mean you have > > > > already > > > > >> crawled? If we can break out the operations each of us are doing end > > > > >> to > > > > end > > > > >> perhaps we can get an apples to apples performance comparison. What > > > > >> I am > > > > >> doing is crawling a list of perhaps 10,000 Urls to a depth of 1 only. > > > > Most > > > > >> are from different hosts. I am finding that there are two main > > > > >> blocks of > > > > >> computation time when I crawl: there is the fetching which seems to > > > > happen > > > > >> quite fast, and that is followed by a lengthy process where the CPU > > > > >> of > > > > the > > > > >> machine is at 100%, but I'm not sure what it's doing. Perhaps it's > > > > parsing > > > > >> at that point? Can you tell me what your operations are and what your > > > > >> configuration is? > > > > >> > > > > >> Kevin > > > > >> > > > > >> > > > > >> On Tue, Sep 23, 2008 at 4:54 AM, Edward Quick <[EMAIL PROTECTED] > > > > >wrote: > > > > >> > > > > >>> > > > > >>> Hi, > > > > >>> > > > > >>> Has anyone tried benchmarking nutch? I just wondered how long I > > > > >>> should > > > > >>> expect different stages of a nutch crawl to take. > > > > >>> > > > > >>> For example, I'm running Nutch on RHEL4 machine with 4 intel 2Ghz > > > > cpu's, > > > > >>> and 4GB ram. This is my nutch fetch process: > > > > >>> > > > > >>> /usr/jdk1.5.0_10/bin/java -Xmx2000m > > > > >>> -Dhadoop.log.dir=/nutch/search/logs > > > > >>> -Dhadoop.log.file=hadoop.log > > > > >>> -Djava.library.path=/nutch/search/lib/native/Linux-i386-32 > > > > >>> -Dhadoop.tmp.dir=/nutch/tmp -Djava.io.tmpdir=/nutch/tmp -classpath > > > > >>> > > > > /nutch/search:/nutch/search/conf:/usr/jdk1.5.0_10/lib/tools.jar:/nutch/search/build:/nutch/search/build/test/classes:/nutch/search/build/nutch-1.0-dev.job:/nutch/search/nutch-*.job:/nutch/search/lib/commons-cli-2.0-SNAPSHOT.jar:/nutch/search/lib/commons-codec-1.3.jar:/nutch/search/lib/commons-httpclient-3.0.1.jar:/nutch/search/lib/commons-lang-2.1.jar:/nutch/search/lib/commons-logging-1.0.4.jar:/nutch/search/lib/commons-logging-api-1.0.4.jar:/nutch/search/lib/hadoop-0.17.1-core.jar:/nutch/search/lib/icu4j-3_6.jar:/nutch/search/lib/jakarta-oro-2.0.7.jar:/nutch/search/lib/jets3t-0.5.0.jar:/nutch/search/lib/jetty-5.1.4.jar:/nutch/search/lib/junit-3.8.1.jar:/nutch/search/lib/log4j-1.2.13.jar:/nutch/search/lib/lucene-core-2.3.0.jar:/nutch/search/lib/lucene-misc-2.3.0.jar:/nutch/search/lib/servlet-api.jar:/nutch/search/lib/taglibs-i18n.jar:/nutch/search/lib/tika-0.1-incubating.jar:/nutch/search/lib/xerces-2_6_2-apis.jar:/nutch/search/lib/xerces-2_6_2.jar:/nutch/search/lib/j etty > > > > -ext/ant.jar:/nutch/search/lib/jetty-ext/commons-el.jar:/nutch/search/lib/jetty-ext/jasper-compiler.jar:/nutch/search/lib/jetty-ext/jasper-runtime.jar:/nutch/search/lib/jetty-ext/jsp-api.jar > > > > >>> org.apache.nutch.fetcher.Fetcher crawl/segments/20080923105853 > > > > >>> > > > > >>> and a fetch of about 100,000 pages (with 20 threads per host) takes > > > > around > > > > >>> 1-2 hours. Does that seem reasonable or too slow? > > > > >>> > > > > >>> Thanks for any help. > > > > >>> > > > > >>> Ed. > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> _________________________________________________________________ > > > > >>> Make a mini you and download it into Windows Live Messenger > > > > >>> http://clk.atdmt.com/UKM/go/111354029/direct/01/ > > > > >> > > > > >> > > > > >> > > > > > > > > > > > > > > > > > > > > > -- > > > > Doğacan Güney > > > > > > > > _________________________________________________________________ > > Get all your favourite content with the slick new MSN Toolbar - FREE > > http://clk.atdmt.com/UKM/go/111354027/direct/01/ > _________________________________________________________________ Make a mini you and download it into Windows Live Messenger http://clk.atdmt.com/UKM/go/111354029/direct/01/
