[ 
https://issues.apache.org/jira/browse/NUTCH-1524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13699127#comment-13699127
 ] 

Brian edited comment on NUTCH-1524 at 7/3/13 4:28 PM:
------------------------------------------------------

Well this was frustrating... it turned out to be due to a bug introduced by a 
previous revision (I'll add a patch).  Specifically in "DbUpdateReducer.java", 
the code writing inlinks to the page gets repeated:
{code}
for (ScoreDatum inlink : inlinkedScoreData) {
   page.putToInlinks(new Utf8(inlink.getUrl()), new Utf8(inlink.getAnchor()));
}
{code}
It is done once then again when the minimum distance is computed.  As a result 
the same key,value pairs are added to the inlinks twice - apparently this 
creates a problem with the persistent storage - I guess it can't handle this 
case and as a result no inlinks get written.  Removing the first loop (initial 
putting the inlinks to the page) resolved the issue for me.


Unfortunately I didn't try this simple modification until after about a day's 
work... When I first noticed the duplicate code it seemed strange but I didn't 
think it could be in any way related.  So I spent all day yesterday carefully 
checking all the code and configurations and trying different modifications, 
and was ready to conclude the issue was not with nutch.  Until I finally 
thought maybe that was the issue as it was really the only odd thing about the 
code.


Is this still maybe an enhancement for hbase/gora? (I.e., it doesn't seem like 
the best behavior for when there are duplicates).

                
      was (Author: brian44):
    Well this was frustrating... it turned out to be due to a bug introduced by 
a previous revision (I'll add a patch).  Specifically in 
"DbUpdateReducer.java", the code writing inlinks to the page gets repeated:
{quote}
for (ScoreDatum inlink : inlinkedScoreData) {
   page.putToInlinks(new Utf8(inlink.getUrl()), new Utf8(inlink.getAnchor()));
}
{quote}
It is done once then again when the minimum distance is computed.  As a result 
the same key,value pairs are added to the inlinks twice - apparently this 
creates a problem with the persistent storage - I guess it can't handle this 
case and as a result no inlinks get written.  Removing the first loop (initial 
putting the inlinks to the page) resolved the issue for me.


Unfortunately I didn't try this simple modification until after about a day's 
work... When I first noticed the duplicate code it seemed strange but I didn't 
think it could be in any way related.  So I spent all day yesterday carefully 
checking all the code and configurations and trying different modifications, 
and was ready to conclude the issue was not with nutch.  Until I finally 
thought maybe that was the issue as it was really the only odd thing about the 
code.


Is this still maybe an enhancement for hbase/gora? (I.e., it doesn't seem like 
the best behavior for when there are duplicates).

                  
> Internal links are not being saved even with change in parameter 
> (db.ignore.internal.links)
> -------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1524
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1524
>             Project: Nutch
>          Issue Type: Bug
>          Components: linkdb
>    Affects Versions: 2.1
>         Environment: Linux, Mac
>            Reporter: kiran
>             Fix For: 2.4
>
>
> The internal links are not being saved. I have tried changing the parameter 
> (db.ignore.internal.links) to false but still the internal links are not 
> saved. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to