Re: benchmarking

Doğacan Güney Tue, 23 Sep 2008 12:54:37 -0700

On Tue, Sep 23, 2008 at 8:51 PM, Kevin MacDonald <[EMAIL PROTECTED]> wrote:
> Some additional info. We are running Nutch on one of Amazon's EC2 small
> instances, which has the equivalent CPU capacity of a 1.0-1.2 GHz 2007
> Opteron or 2007 Xeon processor with 1.7GB of RAM.
> To crawl 35,000 Urls to a depth of 1 here's a breakdown of processing times
> from the log file:
>  - Injecting urls into the Crawl db: 1 min.
>  - Fetching: 46min
>  - Additional processing (unknown): 66min
>
> Fetching is happening at a rate of about 760 Urls/min, or 1.1 million per
> day. The big block of additional processing happens after the last fetch. I
> don't really know what Nutch is doing during that time. Parsing perhaps? I
> would really like to know because that is killing my performance.
>


Are you using "crawl" command? If you are serious about nutch, I would suggest
that you use individual commands (inject/fetch/parse/etc). This should give you
a better idea of what is taking so long.

> Kevin
>
> On Tue, Sep 23, 2008 at 10:14 AM, Kevin MacDonald <[EMAIL PROTECTED]>wrote:
>
>> Edward,
>> I have been doing Crawl operations as opposed to the Fetch operation you're
>> doing below. I think I am a little unclear on the difference. Since you're
>> specifying a segment path when doing a Fetch does that mean you have already
>> crawled? If we can break out the operations each of us are doing end to end
>> perhaps we can get an apples to apples performance comparison. What I am
>> doing is crawling a list of perhaps 10,000 Urls to a depth of 1 only. Most
>> are from different hosts. I am finding that there are two main blocks of
>> computation time when I crawl: there is the fetching which seems to happen
>> quite fast, and that is followed by a lengthy process where the CPU of the
>> machine is at 100%, but I'm not sure what it's doing. Perhaps it's parsing
>> at that point? Can you tell me what your operations are and what your
>> configuration is?
>>
>> Kevin
>>
>>
>> On Tue, Sep 23, 2008 at 4:54 AM, Edward Quick <[EMAIL PROTECTED]>wrote:
>>
>>>
>>> Hi,
>>>
>>> Has anyone tried benchmarking nutch? I just wondered how long I should
>>> expect different stages of a nutch crawl to take.
>>>
>>> For example, I'm running Nutch on RHEL4 machine with 4 intel 2Ghz cpu's,
>>> and 4GB ram. This is my nutch fetch process:
>>>
>>> /usr/jdk1.5.0_10/bin/java -Xmx2000m -Dhadoop.log.dir=/nutch/search/logs
>>> -Dhadoop.log.file=hadoop.log
>>> -Djava.library.path=/nutch/search/lib/native/Linux-i386-32
>>> -Dhadoop.tmp.dir=/nutch/tmp -Djava.io.tmpdir=/nutch/tmp -classpath
>>> /nutch/search:/nutch/search/conf:/usr/jdk1.5.0_10/lib/tools.jar:/nutch/search/build:/nutch/search/build/test/classes:/nutch/search/build/nutch-1.0-dev.job:/nutch/search/nutch-*.job:/nutch/search/lib/commons-cli-2.0-SNAPSHOT.jar:/nutch/search/lib/commons-codec-1.3.jar:/nutch/search/lib/commons-httpclient-3.0.1.jar:/nutch/search/lib/commons-lang-2.1.jar:/nutch/search/lib/commons-logging-1.0.4.jar:/nutch/search/lib/commons-logging-api-1.0.4.jar:/nutch/search/lib/hadoop-0.17.1-core.jar:/nutch/search/lib/icu4j-3_6.jar:/nutch/search/lib/jakarta-oro-2.0.7.jar:/nutch/search/lib/jets3t-0.5.0.jar:/nutch/search/lib/jetty-5.1.4.jar:/nutch/search/lib/junit-3.8.1.jar:/nutch/search/lib/log4j-1.2.13.jar:/nutch/search/lib/lucene-core-2.3.0.jar:/nutch/search/lib/lucene-misc-2.3.0.jar:/nutch/search/lib/servlet-api.jar:/nutch/search/lib/taglibs-i18n.jar:/nutch/search/lib/tika-0.1-incubating.jar:/nutch/search/lib/xerces-2_6_2-apis.jar:/nutch/search/lib/xerces-2_6_2.jar:/nutch/search/lib/jetty-ext/ant.jar:/nutch/search/lib/jetty-ext/commons-el.jar:/nutch/search/lib/jetty-ext/jasper-compiler.jar:/nutch/search/lib/jetty-ext/jasper-runtime.jar:/nutch/search/lib/jetty-ext/jsp-api.jar
>>> org.apache.nutch.fetcher.Fetcher crawl/segments/20080923105853
>>>
>>> and a fetch of about 100,000 pages (with 20 threads per host) takes around
>>> 1-2 hours. Does that seem reasonable or too slow?
>>>
>>> Thanks for any help.
>>>
>>> Ed.
>>>
>>>
>>>
>>>
>>>
>>> _________________________________________________________________
>>> Make a mini you and download it into Windows Live Messenger
>>> http://clk.atdmt.com/UKM/go/111354029/direct/01/
>>
>>
>>
>



-- 
Doğacan Güney

Re: benchmarking

Reply via email to