Ok, I was able to run through a couple of fetch and index cycles without issue. I put up an example of the commands I ran:

http://wiki.apache.org/nutch/NewScoringIndexingExample

Please check this and see if there are differences in what you are currently running. Will help to narrow down potential problems.

Dennis


Dennis Kubes wrote:
I am looking into this now. Sorry about the delay. Any more information you can provide would be helpful.

Dennis

Koch Martina wrote:
Hi,

I'm testing the webgraph functionality of the current trunk, but I think I'm doing something wrong, because the LinkRank job always aborts with the following error message: 2009-02-24 11:32:36,952 INFO webgraph.LinkRank - Finished link counter job 2009-02-24 11:32:36,952 INFO webgraph.LinkRank - Reading numlinks temp file 2009-02-24 11:32:36,952 INFO webgraph.LinkRank - Deleting numlinks temp file 2009-02-24 11:32:36,952 FATAL webgraph.LinkRank - LinkAnalysis: java.lang.NullPointerException at org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:113) at org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:582) at org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:657) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:627)

I'm doing the following steps:
Injector - Generator - Fetcher2 - ParseSegment - WebGraph - Loops - LinkRank - ScoreUpdater - CrawlDb - LinkDb - Indexer - DeleteDubplicates - IndexMerger

If I ignore the fatal error of the LinkRank tool and continue, I get a valid index, but every URL is set to the clear score value defined in the nutch-site with property link.score.updater.clear.score.

I tested other sequences of the steps mentioned above, e.g. updating CrawlDb first, before doing the scoring or doing severeal generate - fetch - parse cycles before starting the scoring for the first time, but nothing helped.

I also tried to use the scoring-link plugin instead of doing the scoring seperately, but then many of the documents in the index get a boost of 0.0 assigned, which is the default initialScore.

Do you have any suggestions on how to perform the webgraph scoring correctly?

Kind regards,

Martina



Reply via email to