How big is your segments? Is is something you can send to me and I can
run and see where it is breaking?
NodeDumper will allow you to dump the inlink counts, outlink counts, or
scores to a text file. Your would need to specify the output directory
and the command would look something like this:
bin/nutch org.apache.nutch.scoring.webgraph.NodeDumper -webgraph
/your/webgraphdb/ -inlinks -output /your/output/dir
You could also use -outlinks for outlinks or -scores for scores. Once
it completes the output should be in a text file under your output
directory as part-00000. You can also use the -topn to limit it to a
given number of top urls (being sorted by the number of inlinks,
outlinks, or highest scores).
The inlinks class you were looking at below is a sequence file so it is
binary. It is part of the webgraphdb and used during LinkRank but it
not usually accessed directly.
Dennis
Bartosz Gadzimski wrote:
Hello,
First - congratulations for new PMC member.
Second - I have still problem with new scoring framework.
After
bin/nutch org.apache.nutch.scoring.webgraph.WebGraph -segment
crawl/segments/20090306093949/ -webgraphdb crawl/webgraphdb
there seems that after dumping:
bin/nutch org.apache.nutch.scoring.webgraph.NodeDumper
webgraph is empty
Accessing the file in webgraphdb/inlinks/part-00000/data there is object
or something:
SEQorg.apache.hadoop.io.Text+org.apache.nutch.scoring.webgraph.LinkDatum
Unfortunately me and me friend can't debug webgraph class for the
moment, it's still to hard for us to understand.
It looks that it might be the reason of linkrank to fail.
Any help highly appreciated.
Thanks,
Bartosz
It looks like there is no
Dennis Kubes pisze:
Ok, I was able to run through a couple of fetch and index cycles
without issue. I put up an example of the commands I ran:
http://wiki.apache.org/nutch/NewScoringIndexingExample
Please check this and see if there are differences in what you are
currently running. Will help to narrow down potential problems.
Dennis
Dennis Kubes wrote:
I am looking into this now. Sorry about the delay. Any more
information you can provide would be helpful.
Dennis
Koch Martina wrote:
Hi,
I'm testing the webgraph functionality of the current trunk, but I
think I'm doing something wrong, because the LinkRank job always
aborts with the following error message:
2009-02-24 11:32:36,952 INFO webgraph.LinkRank - Finished link
counter job
2009-02-24 11:32:36,952 INFO webgraph.LinkRank - Reading numlinks
temp file
2009-02-24 11:32:36,952 INFO webgraph.LinkRank - Deleting numlinks
temp file
2009-02-24 11:32:36,952 FATAL webgraph.LinkRank - LinkAnalysis:
java.lang.NullPointerException
at
org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:113)
at
org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:582)
at
org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:657)
at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:627)
I'm doing the following steps:
Injector - Generator - Fetcher2 - ParseSegment - WebGraph - Loops -
LinkRank - ScoreUpdater - CrawlDb - LinkDb - Indexer -
DeleteDubplicates - IndexMerger
If I ignore the fatal error of the LinkRank tool and continue, I get
a valid index, but every URL is set to the clear score value defined
in the nutch-site with property link.score.updater.clear.score.
I tested other sequences of the steps mentioned above, e.g. updating
CrawlDb first, before doing the scoring or doing severeal generate -
fetch - parse cycles before starting the scoring for the first time,
but nothing helped.
I also tried to use the scoring-link plugin instead of doing the
scoring seperately, but then many of the documents in the index get
a boost of 0.0 assigned, which is the default initialScore.
Do you have any suggestions on how to perform the webgraph scoring
correctly?
Kind regards,
Martina