Re: [Nutch-dev] refetching all pages to update anchor text?

Matt Kangas Mon, 17 Jan 2005 14:52:47 -0800

Doug, I knew there had to be a bug on my end. ;-) Your suggestion was
exactly right. So after that fix and slight fiddling with the sample
htdocs files, I see the following (via "readdb -dumplinks"):

index.html:  4 inlinks
eggs1.html: 3 inlinks
eggs(2-4).html: 1 inlink

This results in the following scores in the WebDB (via "readdb -toppages"):

index.html:   Score: 1.3635329, NextScore: 1.3635329
eggs1.html:  Score: 0.72126204, NextScore: 0.72126204
eggs(2-4).html: Score: 0.38012782, NextScore: 0.38012782

This WebDB data looks correct to me.

Comparing this to the output from "segread -dump -nocontent", I can see:
-  a mismatch between the segment and webdb page scores
- index.html has zero anchors recorded, and eggs1.html has only one
anchor recorded

So my current questions are:

1) Are these the two segment anomolies that you wanted to see a
SegmentNormalizeTool correct?

2) Is there a "nutch readdb" command that will dump out the anchor
texts for each link, so I can do an apples-to-apples comparison in
this script?

3) What's the best route for rewriting this test script as a JUnit
class? Is there a way to do this cleanly w/o having to doctor up .conf
files by hand?

(latest version of the test script is attached.)

--Matt

On Mon, 17 Jan 2005 09:28:10 -0800, Doug Cutting <[EMAIL PROTECTED]> wrote:
> Matt Kangas wrote:
> > Here is the output from "nutch readdb -dumplinks". This is a clearly a
> > truncated link topology for these [ages. Is this the result of a bug
> > in my script? Or is this something the tool should clean up?
> 
> It looks like db.ignore.internal.links is true, so that all but the
> first internal links are ignored.  This parameter determines what
> happens when you add a link from the same host as the page.  If the
> paramter is true and the page already one or more links to it, then we
> ignore the new internal link.
> 
> Doug

test_segnormal.sh
Description: Binary data

Re: [Nutch-dev] refetching all pages to update anchor text?

Reply via email to