Ok, I was able to run through a couple of fetch and index cycles without
issue. I put up an example of the commands I ran:
http://wiki.apache.org/nutch/NewScoringIndexingExample
Please check this and see if there are differences in what you are
currently running. Will help to narrow down potential problems.
Dennis
Dennis Kubes wrote:
I am looking into this now. Sorry about the delay. Any more
information you can provide would be helpful.
Dennis
Koch Martina wrote:
Hi,
I'm testing the webgraph functionality of the current trunk, but I
think I'm doing something wrong, because the LinkRank job always
aborts with the following error message:
2009-02-24 11:32:36,952 INFO webgraph.LinkRank - Finished link
counter job
2009-02-24 11:32:36,952 INFO webgraph.LinkRank - Reading numlinks
temp file
2009-02-24 11:32:36,952 INFO webgraph.LinkRank - Deleting numlinks
temp file
2009-02-24 11:32:36,952 FATAL webgraph.LinkRank - LinkAnalysis:
java.lang.NullPointerException
at
org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:113)
at
org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:582)
at
org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:657)
at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:627)
I'm doing the following steps:
Injector - Generator - Fetcher2 - ParseSegment - WebGraph - Loops -
LinkRank - ScoreUpdater - CrawlDb - LinkDb - Indexer -
DeleteDubplicates - IndexMerger
If I ignore the fatal error of the LinkRank tool and continue, I get a
valid index, but every URL is set to the clear score value defined in
the nutch-site with property link.score.updater.clear.score.
I tested other sequences of the steps mentioned above, e.g. updating
CrawlDb first, before doing the scoring or doing severeal generate -
fetch - parse cycles before starting the scoring for the first time,
but nothing helped.
I also tried to use the scoring-link plugin instead of doing the
scoring seperately, but then many of the documents in the index get a
boost of 0.0 assigned, which is the default initialScore.
Do you have any suggestions on how to perform the webgraph scoring
correctly?
Kind regards,
Martina