Lewis John McGibbney created NUTCH-1907: -------------------------------------------
Summary: Incorrect output of Outlinks to Hosts within HostDbUpdateReducer Key: NUTCH-1907 URL: https://issues.apache.org/jira/browse/NUTCH-1907 Project: Nutch Issue Type: Bug Affects Versions: 2.2.1 Reporter: Lewis John McGibbney Fix For: 2.4 I [explained|http://www.mail-archive.com/user%40nutch.apache.org/msg12917.html] that I found a big in the 2.X HostDb. I was looking into the code within Nutch 2.X HostDbUpdateReducer and 'think' I've discovered a bug in the way we output Host data. https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/host/HostDbUpdateReducer.java#L87 I feel that the following code {code} host.getInlinks().put(new Utf8(outlink), new Utf8(Integer.toString(outlinkCount.getCount(outlink)))); {code} should be changed to the following {code} host.getOutlinks().put(new Utf8(outlink), new Utf8(Integer.toString(outlinkCount.getCount(outlink)))); {code} Notice the difference in population of Outlinks to Host instead of repeated Inlinks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)