Are you fetching from same host? If url list are concentrated in few
hosts, because of politeness setting, a lot of time will spend on
waiting.

On Wed, 2008-09-24 at 15:35 +0000, Edward Quick wrote:
> 
> 
> 
> 
> Hi Kevin,
> 
> Thanks for your reply.
> 
> I haven't checked the other crawl stages as the bottleneck for me is the 
> nutch fetch (mine's configured to do parsing and storing content). Also, I'm 
> fetching from an F5 load-balanced pair of domino lotus notes intranet servers 
> so the network speed is not a factor.
> 
> The actual time "fetching/parsing" urls seems quick, around 1000 a minute 
> with 20 threads, but then there's some processing after that, which seems to 
> increase exponentially as the list gets bigger. I timed some of the fetches, 
> and you can see here, how much the time increases ater I get to 20000 urls:
> 
> bin/nutch readseg -list -dir crawl/segments/
> NAME            GENERATED       FETCHER START           FETCHER END           
>   FETCHED PARSED       *ACTUAL TIME ELAPSED*
> 20080924121754  1               2008-09-24T12:18:04     2008-09-24T12:18:04   
>   1       1
> 20080924121816  58              2008-09-24T12:18:47     2008-09-24T12:18:53   
>   58      25
> 20080924121920  363             2008-09-24T12:19:40     2008-09-24T12:20:02   
>   379     214
> 20080924122034  1987            2008-09-24T12:21:14     2008-09-24T12:22:50   
>   2085    1434
> 20080924122344  4494            2008-09-24T12:24:25     2008-09-24T12:31:14   
>   4598    4318
> 20080924125007  6802            2008-09-24T12:51:02     2008-09-24T13:01:32   
>   6874    6462          => 14:43.16 elapsed
> 20080924131404  8170            2008-09-24T13:15:04     2008-09-24T13:22:29   
>   8317    7802          => 13:26.59 elapsed 
> 20080924132912  21065           2008-09-24T13:30:28     2008-09-24T13:42:49   
>   21081   19699       => 43:42.11 elapsed 
> 20080924141603  26205           2008-09-24T14:18:42     2008-09-24T14:39:26   
>   26327   24649       => 1:50:29 elapsed
> 20080924161725  10998           ?               ?       ?       ?
> 
> 
> Can anyone tell me what is going on in the fetch stage after the urls have 
> been fetched and parsed please? Can this be speeded up in any way?
> 
> Thanks,
> 
> Ed.
> 
> 
> 
> 
> > Date: Tue, 23 Sep 2008 13:57:14 -0700
> > From: [EMAIL PROTECTED]
> > To: [email protected]
> > Subject: Re: benchmarking
> > 
> > I am using individual commands called from Java. I simply used Crawl.java as
> > a starting point. Because I'm not using nutch for search I have already
> > eliminated quite a few things such as building indexes and inverting links.
> > All the fetching and subsequent lengthy operations are happening in
> > Fetcher.fetch(segments, threads). My next step is to hack into that and see
> > what's going on. When I crank up the logs I see a ton of map/reduce stuff
> > happening.
> > 
> > On Tue, Sep 23, 2008 at 12:54 PM, Doğacan Güney <[EMAIL PROTECTED]> wrote:
> > 
> > > On Tue, Sep 23, 2008 at 8:51 PM, Kevin MacDonald <[EMAIL PROTECTED]>
> > > wrote:
> > > > Some additional info. We are running Nutch on one of Amazon's EC2 small
> > > > instances, which has the equivalent CPU capacity of a 1.0-1.2 GHz 2007
> > > > Opteron or 2007 Xeon processor with 1.7GB of RAM.
> > > > To crawl 35,000 Urls to a depth of 1 here's a breakdown of processing
> > > times
> > > > from the log file:
> > > >  - Injecting urls into the Crawl db: 1 min.
> > > >  - Fetching: 46min
> > > >  - Additional processing (unknown): 66min
> > > >
> > > > Fetching is happening at a rate of about 760 Urls/min, or 1.1 million 
> > > > per
> > > > day. The big block of additional processing happens after the last 
> > > > fetch.
> > > I
> > > > don't really know what Nutch is doing during that time. Parsing perhaps?
> > > I
> > > > would really like to know because that is killing my performance.
> > > >
> > >
> > > Are you using "crawl" command? If you are serious about nutch, I would
> > > suggest
> > > that you use individual commands (inject/fetch/parse/etc). This should 
> > > give
> > > you
> > > a better idea of what is taking so long.
> > >
> > > > Kevin
> > > >
> > > > On Tue, Sep 23, 2008 at 10:14 AM, Kevin MacDonald <[EMAIL PROTECTED]
> > > >wrote:
> > > >
> > > >> Edward,
> > > >> I have been doing Crawl operations as opposed to the Fetch operation
> > > you're
> > > >> doing below. I think I am a little unclear on the difference. Since
> > > you're
> > > >> specifying a segment path when doing a Fetch does that mean you have
> > > already
> > > >> crawled? If we can break out the operations each of us are doing end to
> > > end
> > > >> perhaps we can get an apples to apples performance comparison. What I 
> > > >> am
> > > >> doing is crawling a list of perhaps 10,000 Urls to a depth of 1 only.
> > > Most
> > > >> are from different hosts. I am finding that there are two main blocks 
> > > >> of
> > > >> computation time when I crawl: there is the fetching which seems to
> > > happen
> > > >> quite fast, and that is followed by a lengthy process where the CPU of
> > > the
> > > >> machine is at 100%, but I'm not sure what it's doing. Perhaps it's
> > > parsing
> > > >> at that point? Can you tell me what your operations are and what your
> > > >> configuration is?
> > > >>
> > > >> Kevin
> > > >>
> > > >>
> > > >> On Tue, Sep 23, 2008 at 4:54 AM, Edward Quick <[EMAIL PROTECTED]
> > > >wrote:
> > > >>
> > > >>>
> > > >>> Hi,
> > > >>>
> > > >>> Has anyone tried benchmarking nutch? I just wondered how long I should
> > > >>> expect different stages of a nutch crawl to take.
> > > >>>
> > > >>> For example, I'm running Nutch on RHEL4 machine with 4 intel 2Ghz
> > > cpu's,
> > > >>> and 4GB ram. This is my nutch fetch process:
> > > >>>
> > > >>> /usr/jdk1.5.0_10/bin/java -Xmx2000m 
> > > >>> -Dhadoop.log.dir=/nutch/search/logs
> > > >>> -Dhadoop.log.file=hadoop.log
> > > >>> -Djava.library.path=/nutch/search/lib/native/Linux-i386-32
> > > >>> -Dhadoop.tmp.dir=/nutch/tmp -Djava.io.tmpdir=/nutch/tmp -classpath
> > > >>>
> > > /nutch/search:/nutch/search/conf:/usr/jdk1.5.0_10/lib/tools.jar:/nutch/search/build:/nutch/search/build/test/classes:/nutch/search/build/nutch-1.0-dev.job:/nutch/search/nutch-*.job:/nutch/search/lib/commons-cli-2.0-SNAPSHOT.jar:/nutch/search/lib/commons-codec-1.3.jar:/nutch/search/lib/commons-httpclient-3.0.1.jar:/nutch/search/lib/commons-lang-2.1.jar:/nutch/search/lib/commons-logging-1.0.4.jar:/nutch/search/lib/commons-logging-api-1.0.4.jar:/nutch/search/lib/hadoop-0.17.1-core.jar:/nutch/search/lib/icu4j-3_6.jar:/nutch/search/lib/jakarta-oro-2.0.7.jar:/nutch/search/lib/jets3t-0.5.0.jar:/nutch/search/lib/jetty-5.1.4.jar:/nutch/search/lib/junit-3.8.1.jar:/nutch/search/lib/log4j-1.2.13.jar:/nutch/search/lib/lucene-core-2.3.0.jar:/nutch/search/lib/lucene-misc-2.3.0.jar:/nutch/search/lib/servlet-api.jar:/nutch/search/lib/taglibs-i18n.jar:/nutch/search/lib/tika-0.1-incubating.jar:/nutch/search/lib/xerces-2_6_2-apis.jar:/nutch/search/lib/xerces-2_6_2.jar:/nutch/search/lib/jetty
>  
> -ext/ant.jar:/nutch/search/lib/jetty-ext/commons-el.jar:/nutch/search/lib/jetty-ext/jasper-compiler.jar:/nutch/search/lib/jetty-ext/jasper-runtime.jar:/nutch/search/lib/jetty-ext/jsp-api.jar
> > > >>> org.apache.nutch.fetcher.Fetcher crawl/segments/20080923105853
> > > >>>
> > > >>> and a fetch of about 100,000 pages (with 20 threads per host) takes
> > > around
> > > >>> 1-2 hours. Does that seem reasonable or too slow?
> > > >>>
> > > >>> Thanks for any help.
> > > >>>
> > > >>> Ed.
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> _________________________________________________________________
> > > >>> Make a mini you and download it into Windows Live Messenger
> > > >>> http://clk.atdmt.com/UKM/go/111354029/direct/01/
> > > >>
> > > >>
> > > >>
> > > >
> > >
> > >
> > >
> > > --
> > > Doğacan Güney
> > >
> 
> _________________________________________________________________
> Get all your favourite content with the slick new MSN Toolbar - FREE
> http://clk.atdmt.com/UKM/go/111354027/direct/01/

Reply via email to