Hi,
I'm testing the webgraph functionality of the current trunk, but I think I'm
doing something wrong, because the LinkRank job always aborts with the
following error message:
2009-02-24 11:32:36,952 INFO webgraph.LinkRank - Finished link counter job
2009-02-24 11:32:36,952 INFO webgraph.LinkRank - Reading numlinks temp file
2009-02-24 11:32:36,952 INFO webgraph.LinkRank - Deleting numlinks temp file
2009-02-24 11:32:36,952 FATAL webgraph.LinkRank - LinkAnalysis:
java.lang.NullPointerException
at
org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:113)
at
org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:582)
at
org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:657)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:627)
I'm doing the following steps:
Injector - Generator - Fetcher2 - ParseSegment - WebGraph - Loops - LinkRank -
ScoreUpdater - CrawlDb - LinkDb - Indexer - DeleteDubplicates - IndexMerger
If I ignore the fatal error of the LinkRank tool and continue, I get a valid
index, but every URL is set to the clear score value defined in the nutch-site
with property link.score.updater.clear.score.
I tested other sequences of the steps mentioned above, e.g. updating CrawlDb
first, before doing the scoring or doing severeal generate - fetch - parse
cycles before starting the scoring for the first time, but nothing helped.
I also tried to use the scoring-link plugin instead of doing the scoring
seperately, but then many of the documents in the index get a boost of 0.0
assigned, which is the default initialScore.
Do you have any suggestions on how to perform the webgraph scoring correctly?
Kind regards,
Martina