Hi,

Has anyone tried benchmarking nutch? I just wondered how long I should expect 
different stages of a nutch crawl to take.

For example, I'm running Nutch on RHEL4 machine with 4 intel 2Ghz cpu's, and 
4GB ram. This is my nutch fetch process:

/usr/jdk1.5.0_10/bin/java -Xmx2000m -Dhadoop.log.dir=/nutch/search/logs 
-Dhadoop.log.file=hadoop.log 
-Djava.library.path=/nutch/search/lib/native/Linux-i386-32 
-Dhadoop.tmp.dir=/nutch/tmp -Djava.io.tmpdir=/nutch/tmp -classpath 
/nutch/search:/nutch/search/conf:/usr/jdk1.5.0_10/lib/tools.jar:/nutch/search/build:/nutch/search/build/test/classes:/nutch/search/build/nutch-1.0-dev.job:/nutch/search/nutch-*.job:/nutch/search/lib/commons-cli-2.0-SNAPSHOT.jar:/nutch/search/lib/commons-codec-1.3.jar:/nutch/search/lib/commons-httpclient-3.0.1.jar:/nutch/search/lib/commons-lang-2.1.jar:/nutch/search/lib/commons-logging-1.0.4.jar:/nutch/search/lib/commons-logging-api-1.0.4.jar:/nutch/search/lib/hadoop-0.17.1-core.jar:/nutch/search/lib/icu4j-3_6.jar:/nutch/search/lib/jakarta-oro-2.0.7.jar:/nutch/search/lib/jets3t-0.5.0.jar:/nutch/search/lib/jetty-5.1.4.jar:/nutch/search/lib/junit-3.8.1.jar:/nutch/search/lib/log4j-1.2.13.jar:/nutch/search/lib/lucene-core-2.3.0.jar:/nutch/search/lib/lucene-misc-2.3.0.jar:/nutch/search/lib/servlet-api.jar:/nutch/search/lib/taglibs-i18n.jar:/nutch/search/lib/tika-0.1-incubating.jar:/nutch/search/lib/xerces-2_6_2-apis.jar:/nutch/search/lib/xerces-2_6_2.jar:/nutch/search/lib/jetty-ext/ant.jar:/nutch/search/lib/jetty-ext/commons-el.jar:/nutch/search/lib/jetty-ext/jasper-compiler.jar:/nutch/search/lib/jetty-ext/jasper-runtime.jar:/nutch/search/lib/jetty-ext/jsp-api.jar
 org.apache.nutch.fetcher.Fetcher crawl/segments/20080923105853

and a fetch of about 100,000 pages (with 20 threads per host) takes around 1-2 
hours. Does that seem reasonable or too slow?

Thanks for any help.

Ed.





_________________________________________________________________
Make a mini you and download it into Windows Live Messenger
http://clk.atdmt.com/UKM/go/111354029/direct/01/

Reply via email to