Re: [Nutch-dev] refetching all pages to update anchor text?

Matt Kangas Fri, 14 Jan 2005 21:54:01 -0800

Doug, I'm still trying to put together a good, reproduceable test case
for this proposed tool. Here's my second attempt. I think it's almost
in the ballpark, but I'd appreciate it if you could verify my
assumptions here.


Attached is a test script and a sample output file. Here's an overview
of the script:
- $ mkdir TEST TEST/htdocs TEST/db TEST/segments
- write 5 HTML files to the htdocs dir, and a urlfile pointing to the index.html
- creates a webdb and injects the urlfile
- performs three crawl cycles over these files (generate/fetch/updatedb/analyze)
- prints "readdb -dumplinks" and "segread -dump -nocontent
-noparsedata -noparsetext"

The page topology is:
index -> (eggs1, eggs2, eggs3, eggs4)
eggs1 -> (index)
eggs2 -> (index, eggs1)
eggs3 -> (index, eggs1)
eggs4 -> (index)

Given all this, my question is: what are the anomolies in this
program's output which should be fixed by a proper
SegmentNormalizeTool?

Here is the output from "nutch readdb -dumplinks". This is a clearly a
truncated link topology for these [ages. Is this the result of a bug
in my script? Or is this something the tool should clean up?

----
--readdb /home/kangas/nutch-cvs/TEST/db -dumplinks

expr: syntax error
050115 020227 loading file:/home/kangas/nutch-cvs/nutch/conf/nutch-default.xml
050115 020227 loading file:/home/kangas/nutch-cvs/nutch/conf/nutch-site.xml
[EMAIL PROTECTED]

from file:/home/kangas/nutch-cvs/TEST/htdocs/index.html
 to file:/home/kangas/nutch-cvs/TEST/htdocs/eggs1.html
 to file:/home/kangas/nutch-cvs/TEST/htdocs/eggs2.html
 to file:/home/kangas/nutch-cvs/TEST/htdocs/eggs3.html
 to file:/home/kangas/nutch-cvs/TEST/htdocs/eggs4.html
----

(PS: the "expr: syntax error" comes from the nutch script's "cygwin
path translation" line when run on FreeBSD -- seems mostly harmless..)

--Matt

test_segnormal.sh
Description: Binary data

test_segnormal.out
Description: Binary data

Re: [Nutch-dev] refetching all pages to update anchor text?

Reply via email to