[
https://issues.apache.org/jira/browse/NUTCH-838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884997#action_12884997
]
Hudson commented on NUTCH-838:
------------------------------
Integrated in Nutch-trunk #1197 (See
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1197/])
- fix for NUTCH-838 Add timing information to all Tool classes
> Add timing information to all Tool classes
> ------------------------------------------
>
> Key: NUTCH-838
> URL: https://issues.apache.org/jira/browse/NUTCH-838
> Project: Nutch
> Issue Type: New Feature
> Components: fetcher, generator, indexer, linkdb, parser
> Affects Versions: 1.1
> Environment: JDK 1.6, Linux & Windows
> Reporter: Jeroen van Vianen
> Assignee: Chris A. Mattmann
> Fix For: 1.2, 2.0
>
> Attachments: timings.patch
>
>
> Am happily trying to crawl a few hundred URLs incrementally. Performance is
> degrading suddenly after the index reaches approximately 25000 URLs.
> At first each inject (generate, fetch, parse, updatedb) * 3, invertlinks,
> solrindex, solrdedup batch takes approximately half an hour with topN 500,
> but elapsed times now increase to 00h45m, 01h15m, 01h30m with every batch.
> As I'm uncertain which of the phases takes so much time I decided to add
> start and finish times to al classes that implement Tool so I at least have a
> feeling and can review them in a log file.
> Am using pretty old hardware, but I am planning to recrawl these URLs on a
> regular basis and if every iteration is going to take more and more time,
> index updates will be few and far between :-(
> I added timing information to *all* Tool classes for consistency whereas
> there are only 10 or so Tools that are really interesting.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.