[jira] Commented: (NUTCH-269) CrawlDbReducer: OOME because no upper-bound on inlinks count
[ https://issues.apache.org/jira/browse/NUTCH-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797990#action_12797990 ] Julien Nioche commented on NUTCH-269: - I will shortly commit a variant of this approach whereby the inlinks are stored in a priority queue in order to keep the best scoring ones. The size of the queue is determined by the parameter db.update.max.inlinks which has a default value of 1. CrawlDbReducer: OOME because no upper-bound on inlinks count Key: NUTCH-269 URL: https://issues.apache.org/jira/browse/NUTCH-269 Project: Nutch Issue Type: Bug Reporter: stack Assignee: Julien Nioche Priority: Trivial Attachments: too-many-links.patch, too-many-links2.patch A CrawlDB update repeatedly OOME'd because an URL had hundreds of thousands of inlinks (The british foriegn office likes putting a clear.gif multiple times into each page: http://www.fco.gov.uk/Xcelerate/graphics/images/fcomain/clear.gif). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-269) CrawlDbReducer: OOME because no upper-bound on inlinks count
[ https://issues.apache.org/jira/browse/NUTCH-269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798305#action_12798305 ] Hudson commented on NUTCH-269: -- Integrated in Nutch-trunk #1034 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1034/]) : OOME because no upper-bound on inlinks count (stack + jnioche) CrawlDbReducer: OOME because no upper-bound on inlinks count Key: NUTCH-269 URL: https://issues.apache.org/jira/browse/NUTCH-269 Project: Nutch Issue Type: Bug Reporter: stack Assignee: Julien Nioche Priority: Trivial Fix For: 1.1 Attachments: too-many-links.patch, too-many-links2.patch A CrawlDB update repeatedly OOME'd because an URL had hundreds of thousands of inlinks (The british foriegn office likes putting a clear.gif multiple times into each page: http://www.fco.gov.uk/Xcelerate/graphics/images/fcomain/clear.gif). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.