Edward,
I have been doing Crawl operations as opposed to the Fetch operation you're
doing below. I think I am a little unclear on the difference. Since you're
specifying a segment path when doing a Fetch does that mean you have already
crawled? If we can break out the operations each of us are doing end to end
perhaps we can get an apples to apples performance comparison. What I am
doing is crawling a list of perhaps 10,000 Urls to a depth of 1 only. Most
are from different hosts. I am finding that there are two main blocks of
computation time when I crawl: there is the fetching which seems to happen
quite fast, and that is followed by a lengthy process where the CPU of the
machine is at 100%, but I'm not sure what it's doing. Perhaps it's parsing
at that point? Can you tell me what your operations are and what your
configuration is?

Kevin

On Tue, Sep 23, 2008 at 4:54 AM, Edward Quick <[EMAIL PROTECTED]>wrote:

>
> Hi,
>
> Has anyone tried benchmarking nutch? I just wondered how long I should
> expect different stages of a nutch crawl to take.
>
> For example, I'm running Nutch on RHEL4 machine with 4 intel 2Ghz cpu's,
> and 4GB ram. This is my nutch fetch process:
>
> /usr/jdk1.5.0_10/bin/java -Xmx2000m -Dhadoop.log.dir=/nutch/search/logs
> -Dhadoop.log.file=hadoop.log
> -Djava.library.path=/nutch/search/lib/native/Linux-i386-32
> -Dhadoop.tmp.dir=/nutch/tmp -Djava.io.tmpdir=/nutch/tmp -classpath
> /nutch/search:/nutch/search/conf:/usr/jdk1.5.0_10/lib/tools.jar:/nutch/search/build:/nutch/search/build/test/classes:/nutch/search/build/nutch-1.0-dev.job:/nutch/search/nutch-*.job:/nutch/search/lib/commons-cli-2.0-SNAPSHOT.jar:/nutch/search/lib/commons-codec-1.3.jar:/nutch/search/lib/commons-httpclient-3.0.1.jar:/nutch/search/lib/commons-lang-2.1.jar:/nutch/search/lib/commons-logging-1.0.4.jar:/nutch/search/lib/commons-logging-api-1.0.4.jar:/nutch/search/lib/hadoop-0.17.1-core.jar:/nutch/search/lib/icu4j-3_6.jar:/nutch/search/lib/jakarta-oro-2.0.7.jar:/nutch/search/lib/jets3t-0.5.0.jar:/nutch/search/lib/jetty-5.1.4.jar:/nutch/search/lib/junit-3.8.1.jar:/nutch/search/lib/log4j-1.2.13.jar:/nutch/search/lib/lucene-core-2.3.0.jar:/nutch/search/lib/lucene-misc-2.3.0.jar:/nutch/search/lib/servlet-api.jar:/nutch/search/lib/taglibs-i18n.jar:/nutch/search/lib/tika-0.1-incubating.jar:/nutch/search/lib/xerces-2_6_2-apis.jar:/nutch/search/lib/xerces-2_6_2.jar:/nutch/search/lib/jetty-ext/ant.jar:/nutch/search/lib/jetty-ext/commons-el.jar:/nutch/search/lib/jetty-ext/jasper-compiler.jar:/nutch/search/lib/jetty-ext/jasper-runtime.jar:/nutch/search/lib/jetty-ext/jsp-api.jar
> org.apache.nutch.fetcher.Fetcher crawl/segments/20080923105853
>
> and a fetch of about 100,000 pages (with 20 threads per host) takes around
> 1-2 hours. Does that seem reasonable or too slow?
>
> Thanks for any help.
>
> Ed.
>
>
>
>
>
> _________________________________________________________________
> Make a mini you and download it into Windows Live Messenger
> http://clk.atdmt.com/UKM/go/111354029/direct/01/

Reply via email to