I am using individual commands called from Java. I simply used Crawl.java as a starting point. Because I'm not using nutch for search I have already eliminated quite a few things such as building indexes and inverting links. All the fetching and subsequent lengthy operations are happening in Fetcher.fetch(segments, threads). My next step is to hack into that and see what's going on. When I crank up the logs I see a ton of map/reduce stuff happening.
On Tue, Sep 23, 2008 at 12:54 PM, Doğacan Güney <[EMAIL PROTECTED]> wrote: > On Tue, Sep 23, 2008 at 8:51 PM, Kevin MacDonald <[EMAIL PROTECTED]> > wrote: > > Some additional info. We are running Nutch on one of Amazon's EC2 small > > instances, which has the equivalent CPU capacity of a 1.0-1.2 GHz 2007 > > Opteron or 2007 Xeon processor with 1.7GB of RAM. > > To crawl 35,000 Urls to a depth of 1 here's a breakdown of processing > times > > from the log file: > > - Injecting urls into the Crawl db: 1 min. > > - Fetching: 46min > > - Additional processing (unknown): 66min > > > > Fetching is happening at a rate of about 760 Urls/min, or 1.1 million per > > day. The big block of additional processing happens after the last fetch. > I > > don't really know what Nutch is doing during that time. Parsing perhaps? > I > > would really like to know because that is killing my performance. > > > > Are you using "crawl" command? If you are serious about nutch, I would > suggest > that you use individual commands (inject/fetch/parse/etc). This should give > you > a better idea of what is taking so long. > > > Kevin > > > > On Tue, Sep 23, 2008 at 10:14 AM, Kevin MacDonald <[EMAIL PROTECTED] > >wrote: > > > >> Edward, > >> I have been doing Crawl operations as opposed to the Fetch operation > you're > >> doing below. I think I am a little unclear on the difference. Since > you're > >> specifying a segment path when doing a Fetch does that mean you have > already > >> crawled? If we can break out the operations each of us are doing end to > end > >> perhaps we can get an apples to apples performance comparison. What I am > >> doing is crawling a list of perhaps 10,000 Urls to a depth of 1 only. > Most > >> are from different hosts. I am finding that there are two main blocks of > >> computation time when I crawl: there is the fetching which seems to > happen > >> quite fast, and that is followed by a lengthy process where the CPU of > the > >> machine is at 100%, but I'm not sure what it's doing. Perhaps it's > parsing > >> at that point? Can you tell me what your operations are and what your > >> configuration is? > >> > >> Kevin > >> > >> > >> On Tue, Sep 23, 2008 at 4:54 AM, Edward Quick <[EMAIL PROTECTED] > >wrote: > >> > >>> > >>> Hi, > >>> > >>> Has anyone tried benchmarking nutch? I just wondered how long I should > >>> expect different stages of a nutch crawl to take. > >>> > >>> For example, I'm running Nutch on RHEL4 machine with 4 intel 2Ghz > cpu's, > >>> and 4GB ram. This is my nutch fetch process: > >>> > >>> /usr/jdk1.5.0_10/bin/java -Xmx2000m -Dhadoop.log.dir=/nutch/search/logs > >>> -Dhadoop.log.file=hadoop.log > >>> -Djava.library.path=/nutch/search/lib/native/Linux-i386-32 > >>> -Dhadoop.tmp.dir=/nutch/tmp -Djava.io.tmpdir=/nutch/tmp -classpath > >>> > /nutch/search:/nutch/search/conf:/usr/jdk1.5.0_10/lib/tools.jar:/nutch/search/build:/nutch/search/build/test/classes:/nutch/search/build/nutch-1.0-dev.job:/nutch/search/nutch-*.job:/nutch/search/lib/commons-cli-2.0-SNAPSHOT.jar:/nutch/search/lib/commons-codec-1.3.jar:/nutch/search/lib/commons-httpclient-3.0.1.jar:/nutch/search/lib/commons-lang-2.1.jar:/nutch/search/lib/commons-logging-1.0.4.jar:/nutch/search/lib/commons-logging-api-1.0.4.jar:/nutch/search/lib/hadoop-0.17.1-core.jar:/nutch/search/lib/icu4j-3_6.jar:/nutch/search/lib/jakarta-oro-2.0.7.jar:/nutch/search/lib/jets3t-0.5.0.jar:/nutch/search/lib/jetty-5.1.4.jar:/nutch/search/lib/junit-3.8.1.jar:/nutch/search/lib/log4j-1.2.13.jar:/nutch/search/lib/lucene-core-2.3.0.jar:/nutch/search/lib/lucene-misc-2.3.0.jar:/nutch/search/lib/servlet-api.jar:/nutch/search/lib/taglibs-i18n.jar:/nutch/search/lib/tika-0.1-incubating.jar:/nutch/search/lib/xerces-2_6_2-apis.jar:/nutch/search/lib/xerces-2_6_2.jar:/nutch/search/lib/jetty-ext/ant.jar:/nutch/search/lib/jetty-ext/commons-el.jar:/nutch/search/lib/jetty-ext/jasper-compiler.jar:/nutch/search/lib/jetty-ext/jasper-runtime.jar:/nutch/search/lib/jetty-ext/jsp-api.jar > >>> org.apache.nutch.fetcher.Fetcher crawl/segments/20080923105853 > >>> > >>> and a fetch of about 100,000 pages (with 20 threads per host) takes > around > >>> 1-2 hours. Does that seem reasonable or too slow? > >>> > >>> Thanks for any help. > >>> > >>> Ed. > >>> > >>> > >>> > >>> > >>> > >>> _________________________________________________________________ > >>> Make a mini you and download it into Windows Live Messenger > >>> http://clk.atdmt.com/UKM/go/111354029/direct/01/ > >> > >> > >> > > > > > > -- > Doğacan Güney >
