Add timing information to all Tool classes
------------------------------------------

                 Key: NUTCH-838
                 URL: https://issues.apache.org/jira/browse/NUTCH-838
             Project: Nutch
          Issue Type: New Feature
          Components: fetcher, generator, indexer, linkdb, parser
    Affects Versions: 1.1
         Environment: JDK 1.6, Linux & Windows
            Reporter: Jeroen van Vianen
             Fix For: 2.0


Am happily trying to crawl a few hundred URLs incrementally. Performance is 
degrading suddenly after the index reaches approximately 25000 URLs.

At first each inject (generate, fetch, parse, updatedb) * 3, invertlinks, 
solrindex, solrdedup batch takes approximately half an hour with topN 500, but 
elapsed times now increase to 00h45m,  01h15m, 01h30m with every batch. As I'm 
uncertain which of the phases takes so much time I decided to add start and 
finish times to al classes that implement Tool so I at least have a feeling and 
can review them in a log file.

Am using pretty old hardware, but I am planning to recrawl these URLs on a 
regular basis and if every iteration is going to take more and more time, index 
updates will be few and far between :-(

I added timing information to *all* Tool classes for consistency whereas there 
are only 10 or so Tools that are really interesting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to