Add timing information to all Tool classes
------------------------------------------
Key: NUTCH-838
URL: https://issues.apache.org/jira/browse/NUTCH-838
Project: Nutch
Issue Type: New Feature
Components: fetcher, generator, indexer, linkdb, parser
Affects Versions: 1.1
Environment: JDK 1.6, Linux & Windows
Reporter: Jeroen van Vianen
Fix For: 2.0
Am happily trying to crawl a few hundred URLs incrementally. Performance is
degrading suddenly after the index reaches approximately 25000 URLs.
At first each inject (generate, fetch, parse, updatedb) * 3, invertlinks,
solrindex, solrdedup batch takes approximately half an hour with topN 500, but
elapsed times now increase to 00h45m, 01h15m, 01h30m with every batch. As I'm
uncertain which of the phases takes so much time I decided to add start and
finish times to al classes that implement Tool so I at least have a feeling and
can review them in a log file.
Am using pretty old hardware, but I am planning to recrawl these URLs on a
regular basis and if every iteration is going to take more and more time, index
updates will be few and far between :-(
I added timing information to *all* Tool classes for consistency whereas there
are only 10 or so Tools that are really interesting.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.