That is odd, because I am finding that upon completion of the last fetch
there is a lengthy period of computation that has to complete before a
fetch/parse is done. Fetching itself happens at a rate of about 1000 urls
per minute which seems fine, but then the additional process makes the
overall time rather slow. I cranked up logging, and saw a great deal of
output like shown below. That seems to be taking up all the time. I'm
wondering if there's something I can do to optimize nutch for a single
machine install.
mapred.Counters (Counters.java:<init>(135)) - Creating group
org.apache.hadoop.mapred.Task$FileSyst
mapred.Counters (Counters.java:getCounter(190)) - Adding Local bytes read at
0
mapred.Counters (Counters.java:getCounter(190)) - Adding Local bytes written
at 1
mapred.Counters (Counters.java:<init>(135)) - Creating group
org.apache.hadoop.mapred.Task$Counte
mapred.Counters (Counters.java:getCounter(190)) - Adding Map input records
at 0
mapred.Counters (Counters.java:getCounter(190)) - Adding Map output records
at 1
mapred.Counters (Counters.java:getCounter(190)) - Adding Map input bytes at
2
mapred.Counters (Counters.java:getCounter(190)) - Adding Map output bytes at
3
mapred.Counters (Counters.java:getCounter(190)) - Adding Combine input
records at 4
mapred.Counters (Counters.java:getCounter(190)) - Adding Combine output
records at 5
apred.LocalJobRunner (LocalJobRunner.java:statusUpdate(258)) - 2613 pages,
613 errors, 6.7 pages/s, 1
mapred.Counters (Counters.java:<init>(135)) - Creating group
org.apache.hadoop.mapred.Task$FileSyst

On Fri, Sep 19, 2008 at 2:27 AM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

> Kevin MacDonald wrote:
>
>> Which is better for overall performance? To parse during fetching or
>> afterward?
>>
>
> It's slightly faster to parse during fetching ... BUT if a parser crashes
> or catches OOM exception, you are left without content and without parsed
> text, whereas if you fetch and then parse then at least you already have the
> content and can re-run the parse job. Usually the process of getting content
> from remote sites is the bottleneck.
>
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Reply via email to