[
https://issues.apache.org/jira/browse/NUTCH-1524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13699127#comment-13699127
]
Brian edited comment on NUTCH-1524 at 7/3/13 4:12 PM:
------------------------------------------------------
Well this was frustrating... it turned out to be due to a bug introduced by a
previous revision (I'll add a patch). Specifically in "DbUpdateReducer.java",
the code writing inlinks to the page gets repeated:
{quote}
for (ScoreDatum inlink : inlinkedScoreData) {
page.putToInlinks(new Utf8(inlink.getUrl()), new Utf8(inlink.getAnchor()));
}
{quote}
It is done once then again when the minimum distance is computed. As a result
the same key,value pairs are added to the inlinks twice - apparently this
creates a problem with the persistent storage - I guess it can't handle this
case and as a result no inlinks get written. Removing the first loop (initial
putting the inlinks to the page) resolved the issue for me.
Unfortunately I didn't try this simple modification until after about a day's
work... When I first noticed the duplicate code it seemed strange but I didn't
think it could be in any way related. So I spent all day yesterday carefully
checking all the code and configurations and trying different modifications,
and was ready to conclude the issue was not with nutch. Until I finally
thought maybe that was the issue as it was really the only odd thing about the
code.
Is this still maybe an enhancement for hbase/gora? (I.e., it doesn't seem like
the best behavior for when there are duplicates).
was (Author: brian44):
Well this was frustrating... it turned out to be due to a bug introduced by
a previous revision (I'll add a patch). Specifically in
"DbUpdateReducer.java", the code writing inlinks to the page gets repeated:
{quote}
for (ScoreDatum inlink : inlinkedScoreData) {
page.putToInlinks(new Utf8(inlink.getUrl()), new Utf8(inlink.getAnchor()));
}
{quote}
It is done once then again when the minimum distance is computed. As a result
the same key,value pairs are added to the inlinks twice - apparently this
creates a problem with the persistent storage - I guess it can't handle this
case and as a result no inlinks get written. Removing the first loop (initial
putting the inlinks to the page) resolved the issue for me.
Unfortunately I didn't try this simple modification until after about a day's
work... When I first noticed the duplicate code it seemed strange but I didn't
think it could be in any way related. So I spent all day yesterday carefully
checking all the code and configurations and trying different modifications,
and was ready to conclude the issue was not with nutch. Until I finally
thought maybe that was the issue as it was really he only odd thing about the
code.
Is this still maybe an enhancement for hbase/gora? (I.e., it doesn't seem like
the best behavior for when there are duplicates).
> Internal links are not being saved even with change in parameter
> (db.ignore.internal.links)
> -------------------------------------------------------------------------------------------
>
> Key: NUTCH-1524
> URL: https://issues.apache.org/jira/browse/NUTCH-1524
> Project: Nutch
> Issue Type: Bug
> Components: linkdb
> Affects Versions: 2.1
> Environment: Linux, Mac
> Reporter: kiran
> Fix For: 2.4
>
>
> The internal links are not being saved. I have tried changing the parameter
> (db.ignore.internal.links) to false but still the internal links are not
> saved.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira