[jira] Created: (NUTCH-761) Avoid cloningCrawlDatum in CrawlDbReducer

Julien Nioche (JIRA) Tue, 03 Nov 2009 06:40:27 -0800

Avoid cloningCrawlDatum in CrawlDbReducer 
------------------------------------------


                 Key: NUTCH-761
                 URL: https://issues.apache.org/jira/browse/NUTCH-761
             Project: Nutch
          Issue Type: Improvement
            Reporter: Julien Nioche
            Priority: Minor


In the huge majority of cases the CrawlDbReducer gets unique CrawlData in its 
reduce phase and these will be the entries coming from the crawlDB and not 
present in the segments.
The patch attached optimizes the reduce step by avoid an unnecessary cloning of 
the CrawlDatum fields when there is only one CrawlDatum in the values. This has 
more impact has the crawlDB gets larger,  we noticed an improvement of around 
25-30% in the time spent in the reduce phase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (NUTCH-761) Avoid cloningCrawlDatum in CrawlDbReducer

Reply via email to