[ http://issues.apache.org/jira/browse/NUTCH-235?page=comments#action_12371147 ]
Doug Cutting commented on NUTCH-235: ------------------------------------ +1 This looks good. It will be a little slower for simple crawls, where each link is only processed once, but probably not noticeably. It will be significantly faster when re-crawling is performed, since the link db won't balloon. I note that the add() methods are actually unchanged: just reformatted. > Duplicate Inlink values > ----------------------- > > Key: NUTCH-235 > URL: http://issues.apache.org/jira/browse/NUTCH-235 > Project: Nutch > Type: Bug > Versions: 0.8-dev > Reporter: Andrzej Bialecki > Assignee: Andrzej Bialecki > Attachments: patch.txt, set-patch.txt > > Reading the code for LinkDb.reduce(): if we have page duplicates in input > segments, or if we have two copies of the same input segment, we will create > the same Inlink values (satisfying Inlink.equals()) multiple times. Since > Inlinks is a facade for List, and not a Set, we will get duplicate Inlink-s > in Inlinks (if you know what I mean ;) . > The problem is easy to test: create a new linkdb based on 2 identical > segments. This problem also makes it more difficult to properly implement > LinkDB updating mechanism (i.e. incremental invertlinks). > I propose to change Inlinks to use a Set semantics, either explicitly by > using a HashSet or implicitly by checking if a value to be added already > exists. If there are no objections I'll commit this change shortly. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
