[
https://issues.apache.org/jira/browse/NUTCH-838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris A. Mattmann resolved NUTCH-838.
-------------------------------------
Resolution: Fixed
- Patch applied to trunk in r960246 and backported to 1.2-branch in r960248. I
had to make some minor CR-LF mods and avoid patching a few files that were
removed in the latest trunk. Thanks, Jeroen!
> Add timing information to all Tool classes
> ------------------------------------------
>
> Key: NUTCH-838
> URL: https://issues.apache.org/jira/browse/NUTCH-838
> Project: Nutch
> Issue Type: New Feature
> Components: fetcher, generator, indexer, linkdb, parser
> Affects Versions: 1.1
> Environment: JDK 1.6, Linux & Windows
> Reporter: Jeroen van Vianen
> Assignee: Chris A. Mattmann
> Fix For: 1.2, 2.0
>
> Attachments: timings.patch
>
>
> Am happily trying to crawl a few hundred URLs incrementally. Performance is
> degrading suddenly after the index reaches approximately 25000 URLs.
> At first each inject (generate, fetch, parse, updatedb) * 3, invertlinks,
> solrindex, solrdedup batch takes approximately half an hour with topN 500,
> but elapsed times now increase to 00h45m, 01h15m, 01h30m with every batch.
> As I'm uncertain which of the phases takes so much time I decided to add
> start and finish times to al classes that implement Tool so I at least have a
> feeling and can review them in a log file.
> Am using pretty old hardware, but I am planning to recrawl these URLs on a
> regular basis and if every iteration is going to take more and more time,
> index updates will be few and far between :-(
> I added timing information to *all* Tool classes for consistency whereas
> there are only 10 or so Tools that are really interesting.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.