[ 
https://issues.apache.org/jira/browse/NUTCH-838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884997#action_12884997
 ] 

Hudson commented on NUTCH-838:
------------------------------

Integrated in Nutch-trunk #1197 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1197/])
    - fix for NUTCH-838 Add timing information to all Tool classes


> Add timing information to all Tool classes
> ------------------------------------------
>
>                 Key: NUTCH-838
>                 URL: https://issues.apache.org/jira/browse/NUTCH-838
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, generator, indexer, linkdb, parser
>    Affects Versions: 1.1
>         Environment: JDK 1.6, Linux & Windows
>            Reporter: Jeroen van Vianen
>            Assignee: Chris A. Mattmann
>             Fix For: 1.2, 2.0
>
>         Attachments: timings.patch
>
>
> Am happily trying to crawl a few hundred URLs incrementally. Performance is 
> degrading suddenly after the index reaches approximately 25000 URLs.
> At first each inject (generate, fetch, parse, updatedb) * 3, invertlinks, 
> solrindex, solrdedup batch takes approximately half an hour with topN 500, 
> but elapsed times now increase to 00h45m,  01h15m, 01h30m with every batch. 
> As I'm uncertain which of the phases takes so much time I decided to add 
> start and finish times to al classes that implement Tool so I at least have a 
> feeling and can review them in a log file.
> Am using pretty old hardware, but I am planning to recrawl these URLs on a 
> regular basis and if every iteration is going to take more and more time, 
> index updates will be few and far between :-(
> I added timing information to *all* Tool classes for consistency whereas 
> there are only 10 or so Tools that are really interesting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to