[ https://issues.apache.org/jira/browse/NUTCH-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748840#action_12748840 ]
Julien Nioche commented on NUTCH-702: ------------------------------------- There have been quite a few related questions on the mailing-lists since releasing the patch. As pointed out by Andrzej, the best way of avoiding memory exceptions is to limit the number of URLs per inlink (see nutch-default.xml), however this patch allows to set a higher value for this parameter as it reduces the memory footprint. The main advantage of this patch is that it speeds up all phases of a crawl involving the crawlDB. As an illustration I did a bit of comparison between the default version and the patched one, the average results being : Injection (1M URLs) original : 98 secs patched : 65 secs generation : original : 37 secs patched : 37 secs stats: original : 13 secs patched : 19 secs update: original : 40 secs patched : 19 secs I find a bit surprising that the patched version is not faster on the generation (any idea?). On the other phases it seems to be 50% faster, except for the updating where it is twice as fast. J. > Lazy Instanciation of Metadata in CrawlDatum > -------------------------------------------- > > Key: NUTCH-702 > URL: https://issues.apache.org/jira/browse/NUTCH-702 > Project: Nutch > Issue Type: Improvement > Affects Versions: 1.0.0 > Reporter: Julien Nioche > Attachments: NUTCH-702.patch, NUTCH-702.patch.v2, NUTCH-702v3.patch > > > CrawlDatum systematically instanciates its metadata, which is quite wasteful > especially in the case of CrawlDBReducer when it generates a new CrawlDatum > for each incoming link before storing it in a List. > Initial testing on the lazy instanciation shows an improvement in both speed > and memory consumption. I will generate a patch for it ASAP -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.