[jira] Updated: (NUTCH-761) Avoid cloningCrawlDatum in CrawlDbReducer

Julien Nioche (JIRA) Tue, 03 Nov 2009 06:42:27 -0800

     [ 
https://issues.apache.org/jira/browse/NUTCH-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Julien Nioche updated NUTCH-761:
--------------------------------

    Attachment: optiCrawlReducer.patch

> Avoid cloningCrawlDatum in CrawlDbReducer 
> ------------------------------------------
>
>                 Key: NUTCH-761
>                 URL: https://issues.apache.org/jira/browse/NUTCH-761
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Julien Nioche
>            Priority: Minor
>         Attachments: optiCrawlReducer.patch
>
>
> In the huge majority of cases the CrawlDbReducer gets unique CrawlData in its 
> reduce phase and these will be the entries coming from the crawlDB and not 
> present in the segments.
> The patch attached optimizes the reduce step by avoid an unnecessary cloning 
> of the CrawlDatum fields when there is only one CrawlDatum in the values. 
> This has more impact has the crawlDB gets larger,  we noticed an improvement 
> of around 25-30% in the time spent in the reduce phase.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-761) Avoid cloningCrawlDatum in CrawlDbReducer

Reply via email to