That is odd, because I am finding that upon completion of the last fetch there is a lengthy period of computation that has to complete before a fetch/parse is done. Fetching itself happens at a rate of about 1000 urls per minute which seems fine, but then the additional process makes the overall time rather slow. I cranked up logging, and saw a great deal of output like shown below. That seems to be taking up all the time. I'm wondering if there's something I can do to optimize nutch for a single machine install. mapred.Counters (Counters.java:<init>(135)) - Creating group org.apache.hadoop.mapred.Task$FileSyst mapred.Counters (Counters.java:getCounter(190)) - Adding Local bytes read at 0 mapred.Counters (Counters.java:getCounter(190)) - Adding Local bytes written at 1 mapred.Counters (Counters.java:<init>(135)) - Creating group org.apache.hadoop.mapred.Task$Counte mapred.Counters (Counters.java:getCounter(190)) - Adding Map input records at 0 mapred.Counters (Counters.java:getCounter(190)) - Adding Map output records at 1 mapred.Counters (Counters.java:getCounter(190)) - Adding Map input bytes at 2 mapred.Counters (Counters.java:getCounter(190)) - Adding Map output bytes at 3 mapred.Counters (Counters.java:getCounter(190)) - Adding Combine input records at 4 mapred.Counters (Counters.java:getCounter(190)) - Adding Combine output records at 5 apred.LocalJobRunner (LocalJobRunner.java:statusUpdate(258)) - 2613 pages, 613 errors, 6.7 pages/s, 1 mapred.Counters (Counters.java:<init>(135)) - Creating group org.apache.hadoop.mapred.Task$FileSyst
On Fri, Sep 19, 2008 at 2:27 AM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Kevin MacDonald wrote: > >> Which is better for overall performance? To parse during fetching or >> afterward? >> > > It's slightly faster to parse during fetching ... BUT if a parser crashes > or catches OOM exception, you are left without content and without parsed > text, whereas if you fetch and then parse then at least you already have the > content and can re-run the parse job. Usually the process of getting content > from remote sites is the bottleneck. > > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > >
