Doug, I knew there had to be a bug on my end. ;-) Your suggestion was exactly right. So after that fix and slight fiddling with the sample htdocs files, I see the following (via "readdb -dumplinks"):
index.html: 4 inlinks eggs1.html: 3 inlinks eggs(2-4).html: 1 inlink This results in the following scores in the WebDB (via "readdb -toppages"): index.html: Score: 1.3635329, NextScore: 1.3635329 eggs1.html: Score: 0.72126204, NextScore: 0.72126204 eggs(2-4).html: Score: 0.38012782, NextScore: 0.38012782 This WebDB data looks correct to me. Comparing this to the output from "segread -dump -nocontent", I can see: - a mismatch between the segment and webdb page scores - index.html has zero anchors recorded, and eggs1.html has only one anchor recorded So my current questions are: 1) Are these the two segment anomolies that you wanted to see a SegmentNormalizeTool correct? 2) Is there a "nutch readdb" command that will dump out the anchor texts for each link, so I can do an apples-to-apples comparison in this script? 3) What's the best route for rewriting this test script as a JUnit class? Is there a way to do this cleanly w/o having to doctor up .conf files by hand? (latest version of the test script is attached.) --Matt On Mon, 17 Jan 2005 09:28:10 -0800, Doug Cutting <[EMAIL PROTECTED]> wrote: > Matt Kangas wrote: > > Here is the output from "nutch readdb -dumplinks". This is a clearly a > > truncated link topology for these [ages. Is this the result of a bug > > in my script? Or is this something the tool should clean up? > > It looks like db.ignore.internal.links is true, so that all but the > first internal links are ignored. This parameter determines what > happens when you add a link from the same host as the page. If the > paramter is true and the page already one or more links to it, then we > ignore the new internal link. > > Doug
test_segnormal.sh
Description: Binary data
