Edward, I have been doing Crawl operations as opposed to the Fetch operation you're doing below. I think I am a little unclear on the difference. Since you're specifying a segment path when doing a Fetch does that mean you have already crawled? If we can break out the operations each of us are doing end to end perhaps we can get an apples to apples performance comparison. What I am doing is crawling a list of perhaps 10,000 Urls to a depth of 1 only. Most are from different hosts. I am finding that there are two main blocks of computation time when I crawl: there is the fetching which seems to happen quite fast, and that is followed by a lengthy process where the CPU of the machine is at 100%, but I'm not sure what it's doing. Perhaps it's parsing at that point? Can you tell me what your operations are and what your configuration is?
Kevin On Tue, Sep 23, 2008 at 4:54 AM, Edward Quick <[EMAIL PROTECTED]>wrote: > > Hi, > > Has anyone tried benchmarking nutch? I just wondered how long I should > expect different stages of a nutch crawl to take. > > For example, I'm running Nutch on RHEL4 machine with 4 intel 2Ghz cpu's, > and 4GB ram. This is my nutch fetch process: > > /usr/jdk1.5.0_10/bin/java -Xmx2000m -Dhadoop.log.dir=/nutch/search/logs > -Dhadoop.log.file=hadoop.log > -Djava.library.path=/nutch/search/lib/native/Linux-i386-32 > -Dhadoop.tmp.dir=/nutch/tmp -Djava.io.tmpdir=/nutch/tmp -classpath > /nutch/search:/nutch/search/conf:/usr/jdk1.5.0_10/lib/tools.jar:/nutch/search/build:/nutch/search/build/test/classes:/nutch/search/build/nutch-1.0-dev.job:/nutch/search/nutch-*.job:/nutch/search/lib/commons-cli-2.0-SNAPSHOT.jar:/nutch/search/lib/commons-codec-1.3.jar:/nutch/search/lib/commons-httpclient-3.0.1.jar:/nutch/search/lib/commons-lang-2.1.jar:/nutch/search/lib/commons-logging-1.0.4.jar:/nutch/search/lib/commons-logging-api-1.0.4.jar:/nutch/search/lib/hadoop-0.17.1-core.jar:/nutch/search/lib/icu4j-3_6.jar:/nutch/search/lib/jakarta-oro-2.0.7.jar:/nutch/search/lib/jets3t-0.5.0.jar:/nutch/search/lib/jetty-5.1.4.jar:/nutch/search/lib/junit-3.8.1.jar:/nutch/search/lib/log4j-1.2.13.jar:/nutch/search/lib/lucene-core-2.3.0.jar:/nutch/search/lib/lucene-misc-2.3.0.jar:/nutch/search/lib/servlet-api.jar:/nutch/search/lib/taglibs-i18n.jar:/nutch/search/lib/tika-0.1-incubating.jar:/nutch/search/lib/xerces-2_6_2-apis.jar:/nutch/search/lib/xerces-2_6_2.jar:/nutch/search/lib/jetty-ext/ant.jar:/nutch/search/lib/jetty-ext/commons-el.jar:/nutch/search/lib/jetty-ext/jasper-compiler.jar:/nutch/search/lib/jetty-ext/jasper-runtime.jar:/nutch/search/lib/jetty-ext/jsp-api.jar > org.apache.nutch.fetcher.Fetcher crawl/segments/20080923105853 > > and a fetch of about 100,000 pages (with 20 threads per host) takes around > 1-2 hours. Does that seem reasonable or too slow? > > Thanks for any help. > > Ed. > > > > > > _________________________________________________________________ > Make a mini you and download it into Windows Live Messenger > http://clk.atdmt.com/UKM/go/111354029/direct/01/
