Hello,

First - congratulations for new PMC member.

Second - I have still problem with new scoring framework.

After

bin/nutch org.apache.nutch.scoring.webgraph.WebGraph -segment 
crawl/segments/20090306093949/ -webgraphdb crawl/webgraphdb


there seems that after dumping:

bin/nutch org.apache.nutch.scoring.webgraph.NodeDumper

webgraph is empty

Accessing the file in webgraphdb/inlinks/part-00000/data there is object or 
something:
SEQorg.apache.hadoop.io.Text+org.apache.nutch.scoring.webgraph.LinkDatum

Unfortunately me and me friend can't debug webgraph class for the moment, it's 
still to hard for us to understand.

It looks that it might be the reason of linkrank to fail.

Any help highly appreciated.

Thanks,
Bartosz


It looks like there is no
Dennis Kubes pisze:
Ok, I was able to run through a couple of fetch and index cycles without issue. I put up an example of the commands I ran:

http://wiki.apache.org/nutch/NewScoringIndexingExample

Please check this and see if there are differences in what you are currently running. Will help to narrow down potential problems.

Dennis


Dennis Kubes wrote:
I am looking into this now. Sorry about the delay. Any more information you can provide would be helpful.

Dennis

Koch Martina wrote:
Hi,

I'm testing the webgraph functionality of the current trunk, but I think I'm doing something wrong, because the LinkRank job always aborts with the following error message: 2009-02-24 11:32:36,952 INFO webgraph.LinkRank - Finished link counter job 2009-02-24 11:32:36,952 INFO webgraph.LinkRank - Reading numlinks temp file 2009-02-24 11:32:36,952 INFO webgraph.LinkRank - Deleting numlinks temp file 2009-02-24 11:32:36,952 FATAL webgraph.LinkRank - LinkAnalysis: java.lang.NullPointerException at org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:113) at org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:582) at org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:657) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:627)

I'm doing the following steps:
Injector - Generator - Fetcher2 - ParseSegment - WebGraph - Loops - LinkRank - ScoreUpdater - CrawlDb - LinkDb - Indexer - DeleteDubplicates - IndexMerger

If I ignore the fatal error of the LinkRank tool and continue, I get a valid index, but every URL is set to the clear score value defined in the nutch-site with property link.score.updater.clear.score.

I tested other sequences of the steps mentioned above, e.g. updating CrawlDb first, before doing the scoring or doing severeal generate - fetch - parse cycles before starting the scoring for the first time, but nothing helped.

I also tried to use the scoring-link plugin instead of doing the scoring seperately, but then many of the documents in the index get a boost of 0.0 assigned, which is the default initialScore.

Do you have any suggestions on how to perform the webgraph scoring correctly?

Kind regards,

Martina





Reply via email to