Duplicate Inlink values
-----------------------
Key: NUTCH-235
URL: http://issues.apache.org/jira/browse/NUTCH-235
Project: Nutch
Type: Bug
Versions: 0.8-dev
Reporter: Andrzej Bialecki
Assigned to: Andrzej Bialecki
Reading the code for LinkDb.reduce(): if we have page duplicates in input
segments, or if we have two copies of the same input segment, we will create
the same Inlink values (satisfying Inlink.equals()) multiple times. Since
Inlinks is a facade for List, and not a Set, we will get duplicate Inlink-s in
Inlinks (if you know what I mean ;) .
The problem is easy to test: create a new linkdb based on 2 identical segments.
This problem also makes it more difficult to properly implement LinkDB updating
mechanism (i.e. incremental invertlinks).
I propose to change Inlinks to use a Set semantics, either explicitly by using
a HashSet or implicitly by checking if a value to be added already exists. If
there are no objections I'll commit this change shortly.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers