Hi Kevin,

Thanks for your reply.

I haven't checked the other crawl stages as the bottleneck for me is the nutch 
fetch (mine's configured to do parsing and storing content). Also, I'm fetching 
from an F5 load-balanced pair of domino lotus notes intranet servers so the 
network speed is not a factor.

The actual time "fetching/parsing" urls seems quick, around 1000 a minute with 
20 threads, but then there's some processing after that, which seems to 
increase exponentially as the list gets bigger. I timed some of the fetches, 
and you can see here, how much the time increases ater I get to 20000 urls:

bin/nutch readseg -list -dir crawl/segments/
NAME            GENERATED       FETCHER START           FETCHER END             
FETCHED PARSED       *ACTUAL TIME ELAPSED*
20080924121754  1               2008-09-24T12:18:04     2008-09-24T12:18:04     
1       1
20080924121816  58              2008-09-24T12:18:47     2008-09-24T12:18:53     
58      25
20080924121920  363             2008-09-24T12:19:40     2008-09-24T12:20:02     
379     214
20080924122034  1987            2008-09-24T12:21:14     2008-09-24T12:22:50     
2085    1434
20080924122344  4494            2008-09-24T12:24:25     2008-09-24T12:31:14     
4598    4318
20080924125007  6802            2008-09-24T12:51:02     2008-09-24T13:01:32     
6874    6462          => 14:43.16 elapsed
20080924131404  8170            2008-09-24T13:15:04     2008-09-24T13:22:29     
8317    7802          => 13:26.59 elapsed 
20080924132912  21065           2008-09-24T13:30:28     2008-09-24T13:42:49     
21081   19699       => 43:42.11 elapsed 
20080924141603  26205           2008-09-24T14:18:42     2008-09-24T14:39:26     
26327   24649       => 1:50:29 elapsed
20080924161725  10998           ?               ?       ?       ?


Can anyone tell me what is going on in the fetch stage after the urls have been 
fetched and parsed please? Can this be speeded up in any way?

Thanks,

Ed.




> Date: Tue, 23 Sep 2008 13:57:14 -0700
> From: [EMAIL PROTECTED]
> To: [email protected]
> Subject: Re: benchmarking
> 
> I am using individual commands called from Java. I simply used Crawl.java as
> a starting point. Because I'm not using nutch for search I have already
> eliminated quite a few things such as building indexes and inverting links.
> All the fetching and subsequent lengthy operations are happening in
> Fetcher.fetch(segments, threads). My next step is to hack into that and see
> what's going on. When I crank up the logs I see a ton of map/reduce stuff
> happening.
> 
> On Tue, Sep 23, 2008 at 12:54 PM, Doğacan Güney <[EMAIL PROTECTED]> wrote:
> 
> > On Tue, Sep 23, 2008 at 8:51 PM, Kevin MacDonald <[EMAIL PROTECTED]>
> > wrote:
> > > Some additional info. We are running Nutch on one of Amazon's EC2 small
> > > instances, which has the equivalent CPU capacity of a 1.0-1.2 GHz 2007
> > > Opteron or 2007 Xeon processor with 1.7GB of RAM.
> > > To crawl 35,000 Urls to a depth of 1 here's a breakdown of processing
> > times
> > > from the log file:
> > >  - Injecting urls into the Crawl db: 1 min.
> > >  - Fetching: 46min
> > >  - Additional processing (unknown): 66min
> > >
> > > Fetching is happening at a rate of about 760 Urls/min, or 1.1 million per
> > > day. The big block of additional processing happens after the last fetch.
> > I
> > > don't really know what Nutch is doing during that time. Parsing perhaps?
> > I
> > > would really like to know because that is killing my performance.
> > >
> >
> > Are you using "crawl" command? If you are serious about nutch, I would
> > suggest
> > that you use individual commands (inject/fetch/parse/etc). This should give
> > you
> > a better idea of what is taking so long.
> >
> > > Kevin
> > >
> > > On Tue, Sep 23, 2008 at 10:14 AM, Kevin MacDonald <[EMAIL PROTECTED]
> > >wrote:
> > >
> > >> Edward,
> > >> I have been doing Crawl operations as opposed to the Fetch operation
> > you're
> > >> doing below. I think I am a little unclear on the difference. Since
> > you're
> > >> specifying a segment path when doing a Fetch does that mean you have
> > already
> > >> crawled? If we can break out the operations each of us are doing end to
> > end
> > >> perhaps we can get an apples to apples performance comparison. What I am
> > >> doing is crawling a list of perhaps 10,000 Urls to a depth of 1 only.
> > Most
> > >> are from different hosts. I am finding that there are two main blocks of
> > >> computation time when I crawl: there is the fetching which seems to
> > happen
> > >> quite fast, and that is followed by a lengthy process where the CPU of
> > the
> > >> machine is at 100%, but I'm not sure what it's doing. Perhaps it's
> > parsing
> > >> at that point? Can you tell me what your operations are and what your
> > >> configuration is?
> > >>
> > >> Kevin
> > >>
> > >>
> > >> On Tue, Sep 23, 2008 at 4:54 AM, Edward Quick <[EMAIL PROTECTED]
> > >wrote:
> > >>
> > >>>
> > >>> Hi,
> > >>>
> > >>> Has anyone tried benchmarking nutch? I just wondered how long I should
> > >>> expect different stages of a nutch crawl to take.
> > >>>
> > >>> For example, I'm running Nutch on RHEL4 machine with 4 intel 2Ghz
> > cpu's,
> > >>> and 4GB ram. This is my nutch fetch process:
> > >>>
> > >>> /usr/jdk1.5.0_10/bin/java -Xmx2000m -Dhadoop.log.dir=/nutch/search/logs
> > >>> -Dhadoop.log.file=hadoop.log
> > >>> -Djava.library.path=/nutch/search/lib/native/Linux-i386-32
> > >>> -Dhadoop.tmp.dir=/nutch/tmp -Djava.io.tmpdir=/nutch/tmp -classpath
> > >>>
> > /nutch/search:/nutch/search/conf:/usr/jdk1.5.0_10/lib/tools.jar:/nutch/search/build:/nutch/search/build/test/classes:/nutch/search/build/nutch-1.0-dev.job:/nutch/search/nutch-*.job:/nutch/search/lib/commons-cli-2.0-SNAPSHOT.jar:/nutch/search/lib/commons-codec-1.3.jar:/nutch/search/lib/commons-httpclient-3.0.1.jar:/nutch/search/lib/commons-lang-2.1.jar:/nutch/search/lib/commons-logging-1.0.4.jar:/nutch/search/lib/commons-logging-api-1.0.4.jar:/nutch/search/lib/hadoop-0.17.1-core.jar:/nutch/search/lib/icu4j-3_6.jar:/nutch/search/lib/jakarta-oro-2.0.7.jar:/nutch/search/lib/jets3t-0.5.0.jar:/nutch/search/lib/jetty-5.1.4.jar:/nutch/search/lib/junit-3.8.1.jar:/nutch/search/lib/log4j-1.2.13.jar:/nutch/search/lib/lucene-core-2.3.0.jar:/nutch/search/lib/lucene-misc-2.3.0.jar:/nutch/search/lib/servlet-api.jar:/nutch/search/lib/taglibs-i18n.jar:/nutch/search/lib/tika-0.1-incubating.jar:/nutch/search/lib/xerces-2_6_2-apis.jar:/nutch/search/lib/xerces-2_6_2.jar:/nutch/search/lib/jetty
 
-ext/ant.jar:/nutch/search/lib/jetty-ext/commons-el.jar:/nutch/search/lib/jetty-ext/jasper-compiler.jar:/nutch/search/lib/jetty-ext/jasper-runtime.jar:/nutch/search/lib/jetty-ext/jsp-api.jar
> > >>> org.apache.nutch.fetcher.Fetcher crawl/segments/20080923105853
> > >>>
> > >>> and a fetch of about 100,000 pages (with 20 threads per host) takes
> > around
> > >>> 1-2 hours. Does that seem reasonable or too slow?
> > >>>
> > >>> Thanks for any help.
> > >>>
> > >>> Ed.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> _________________________________________________________________
> > >>> Make a mini you and download it into Windows Live Messenger
> > >>> http://clk.atdmt.com/UKM/go/111354029/direct/01/
> > >>
> > >>
> > >>
> > >
> >
> >
> >
> > --
> > Doğacan Güney
> >

_________________________________________________________________
Get all your favourite content with the slick new MSN Toolbar - FREE
http://clk.atdmt.com/UKM/go/111354027/direct/01/

Reply via email to