Hi Kevin,
Thanks for your reply.
I haven't checked the other crawl stages as the bottleneck for me is the nutch
fetch (mine's configured to do parsing and storing content). Also, I'm fetching
from an F5 load-balanced pair of domino lotus notes intranet servers so the
network speed is not a factor.
The actual time "fetching/parsing" urls seems quick, around 1000 a minute with
20 threads, but then there's some processing after that, which seems to
increase exponentially as the list gets bigger. I timed some of the fetches,
and you can see here, how much the time increases ater I get to 20000 urls:
bin/nutch readseg -list -dir crawl/segments/
NAME GENERATED FETCHER START FETCHER END
FETCHED PARSED *ACTUAL TIME ELAPSED*
20080924121754 1 2008-09-24T12:18:04 2008-09-24T12:18:04
1 1
20080924121816 58 2008-09-24T12:18:47 2008-09-24T12:18:53
58 25
20080924121920 363 2008-09-24T12:19:40 2008-09-24T12:20:02
379 214
20080924122034 1987 2008-09-24T12:21:14 2008-09-24T12:22:50
2085 1434
20080924122344 4494 2008-09-24T12:24:25 2008-09-24T12:31:14
4598 4318
20080924125007 6802 2008-09-24T12:51:02 2008-09-24T13:01:32
6874 6462 => 14:43.16 elapsed
20080924131404 8170 2008-09-24T13:15:04 2008-09-24T13:22:29
8317 7802 => 13:26.59 elapsed
20080924132912 21065 2008-09-24T13:30:28 2008-09-24T13:42:49
21081 19699 => 43:42.11 elapsed
20080924141603 26205 2008-09-24T14:18:42 2008-09-24T14:39:26
26327 24649 => 1:50:29 elapsed
20080924161725 10998 ? ? ? ?
Can anyone tell me what is going on in the fetch stage after the urls have been
fetched and parsed please? Can this be speeded up in any way?
Thanks,
Ed.
> Date: Tue, 23 Sep 2008 13:57:14 -0700
> From: [EMAIL PROTECTED]
> To: [email protected]
> Subject: Re: benchmarking
>
> I am using individual commands called from Java. I simply used Crawl.java as
> a starting point. Because I'm not using nutch for search I have already
> eliminated quite a few things such as building indexes and inverting links.
> All the fetching and subsequent lengthy operations are happening in
> Fetcher.fetch(segments, threads). My next step is to hack into that and see
> what's going on. When I crank up the logs I see a ton of map/reduce stuff
> happening.
>
> On Tue, Sep 23, 2008 at 12:54 PM, Doğacan Güney <[EMAIL PROTECTED]> wrote:
>
> > On Tue, Sep 23, 2008 at 8:51 PM, Kevin MacDonald <[EMAIL PROTECTED]>
> > wrote:
> > > Some additional info. We are running Nutch on one of Amazon's EC2 small
> > > instances, which has the equivalent CPU capacity of a 1.0-1.2 GHz 2007
> > > Opteron or 2007 Xeon processor with 1.7GB of RAM.
> > > To crawl 35,000 Urls to a depth of 1 here's a breakdown of processing
> > times
> > > from the log file:
> > > - Injecting urls into the Crawl db: 1 min.
> > > - Fetching: 46min
> > > - Additional processing (unknown): 66min
> > >
> > > Fetching is happening at a rate of about 760 Urls/min, or 1.1 million per
> > > day. The big block of additional processing happens after the last fetch.
> > I
> > > don't really know what Nutch is doing during that time. Parsing perhaps?
> > I
> > > would really like to know because that is killing my performance.
> > >
> >
> > Are you using "crawl" command? If you are serious about nutch, I would
> > suggest
> > that you use individual commands (inject/fetch/parse/etc). This should give
> > you
> > a better idea of what is taking so long.
> >
> > > Kevin
> > >
> > > On Tue, Sep 23, 2008 at 10:14 AM, Kevin MacDonald <[EMAIL PROTECTED]
> > >wrote:
> > >
> > >> Edward,
> > >> I have been doing Crawl operations as opposed to the Fetch operation
> > you're
> > >> doing below. I think I am a little unclear on the difference. Since
> > you're
> > >> specifying a segment path when doing a Fetch does that mean you have
> > already
> > >> crawled? If we can break out the operations each of us are doing end to
> > end
> > >> perhaps we can get an apples to apples performance comparison. What I am
> > >> doing is crawling a list of perhaps 10,000 Urls to a depth of 1 only.
> > Most
> > >> are from different hosts. I am finding that there are two main blocks of
> > >> computation time when I crawl: there is the fetching which seems to
> > happen
> > >> quite fast, and that is followed by a lengthy process where the CPU of
> > the
> > >> machine is at 100%, but I'm not sure what it's doing. Perhaps it's
> > parsing
> > >> at that point? Can you tell me what your operations are and what your
> > >> configuration is?
> > >>
> > >> Kevin
> > >>
> > >>
> > >> On Tue, Sep 23, 2008 at 4:54 AM, Edward Quick <[EMAIL PROTECTED]
> > >wrote:
> > >>
> > >>>
> > >>> Hi,
> > >>>
> > >>> Has anyone tried benchmarking nutch? I just wondered how long I should
> > >>> expect different stages of a nutch crawl to take.
> > >>>
> > >>> For example, I'm running Nutch on RHEL4 machine with 4 intel 2Ghz
> > cpu's,
> > >>> and 4GB ram. This is my nutch fetch process:
> > >>>
> > >>> /usr/jdk1.5.0_10/bin/java -Xmx2000m -Dhadoop.log.dir=/nutch/search/logs
> > >>> -Dhadoop.log.file=hadoop.log
> > >>> -Djava.library.path=/nutch/search/lib/native/Linux-i386-32
> > >>> -Dhadoop.tmp.dir=/nutch/tmp -Djava.io.tmpdir=/nutch/tmp -classpath
> > >>>
> > /nutch/search:/nutch/search/conf:/usr/jdk1.5.0_10/lib/tools.jar:/nutch/search/build:/nutch/search/build/test/classes:/nutch/search/build/nutch-1.0-dev.job:/nutch/search/nutch-*.job:/nutch/search/lib/commons-cli-2.0-SNAPSHOT.jar:/nutch/search/lib/commons-codec-1.3.jar:/nutch/search/lib/commons-httpclient-3.0.1.jar:/nutch/search/lib/commons-lang-2.1.jar:/nutch/search/lib/commons-logging-1.0.4.jar:/nutch/search/lib/commons-logging-api-1.0.4.jar:/nutch/search/lib/hadoop-0.17.1-core.jar:/nutch/search/lib/icu4j-3_6.jar:/nutch/search/lib/jakarta-oro-2.0.7.jar:/nutch/search/lib/jets3t-0.5.0.jar:/nutch/search/lib/jetty-5.1.4.jar:/nutch/search/lib/junit-3.8.1.jar:/nutch/search/lib/log4j-1.2.13.jar:/nutch/search/lib/lucene-core-2.3.0.jar:/nutch/search/lib/lucene-misc-2.3.0.jar:/nutch/search/lib/servlet-api.jar:/nutch/search/lib/taglibs-i18n.jar:/nutch/search/lib/tika-0.1-incubating.jar:/nutch/search/lib/xerces-2_6_2-apis.jar:/nutch/search/lib/xerces-2_6_2.jar:/nutch/search/lib/jetty
-ext/ant.jar:/nutch/search/lib/jetty-ext/commons-el.jar:/nutch/search/lib/jetty-ext/jasper-compiler.jar:/nutch/search/lib/jetty-ext/jasper-runtime.jar:/nutch/search/lib/jetty-ext/jsp-api.jar
> > >>> org.apache.nutch.fetcher.Fetcher crawl/segments/20080923105853
> > >>>
> > >>> and a fetch of about 100,000 pages (with 20 threads per host) takes
> > around
> > >>> 1-2 hours. Does that seem reasonable or too slow?
> > >>>
> > >>> Thanks for any help.
> > >>>
> > >>> Ed.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> _________________________________________________________________
> > >>> Make a mini you and download it into Windows Live Messenger
> > >>> http://clk.atdmt.com/UKM/go/111354029/direct/01/
> > >>
> > >>
> > >>
> > >
> >
> >
> >
> > --
> > Doğacan Güney
> >
_________________________________________________________________
Get all your favourite content with the slick new MSN Toolbar - FREE
http://clk.atdmt.com/UKM/go/111354027/direct/01/