Lewis John McGibbney created NUTCH-1907:
-------------------------------------------

             Summary: Incorrect output of Outlinks to Hosts within 
HostDbUpdateReducer 
                 Key: NUTCH-1907
                 URL: https://issues.apache.org/jira/browse/NUTCH-1907
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 2.2.1
            Reporter: Lewis John McGibbney
             Fix For: 2.4


I [explained|http://www.mail-archive.com/user%40nutch.apache.org/msg12917.html] 
that I found a big in the 2.X HostDb.
I was looking into the code within Nutch 2.X HostDbUpdateReducer and
'think' I've discovered a bug in the way we output Host data.
https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/host/HostDbUpdateReducer.java#L87
I feel that the following code
{code}
host.getInlinks().put(new Utf8(outlink), new
Utf8(Integer.toString(outlinkCount.getCount(outlink))));
{code}
should be changed to the following
{code}
host.getOutlinks().put(new Utf8(outlink), new
Utf8(Integer.toString(outlinkCount.getCount(outlink))));
{code}
Notice the difference in population of Outlinks to Host instead of repeated 
Inlinks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to