Doug, I'm still trying to put together a good, reproduceable test case for this proposed tool. Here's my second attempt. I think it's almost in the ballpark, but I'd appreciate it if you could verify my assumptions here.
Attached is a test script and a sample output file. Here's an overview of the script: - $ mkdir TEST TEST/htdocs TEST/db TEST/segments - write 5 HTML files to the htdocs dir, and a urlfile pointing to the index.html - creates a webdb and injects the urlfile - performs three crawl cycles over these files (generate/fetch/updatedb/analyze) - prints "readdb -dumplinks" and "segread -dump -nocontent -noparsedata -noparsetext" The page topology is: index -> (eggs1, eggs2, eggs3, eggs4) eggs1 -> (index) eggs2 -> (index, eggs1) eggs3 -> (index, eggs1) eggs4 -> (index) Given all this, my question is: what are the anomolies in this program's output which should be fixed by a proper SegmentNormalizeTool? Here is the output from "nutch readdb -dumplinks". This is a clearly a truncated link topology for these [ages. Is this the result of a bug in my script? Or is this something the tool should clean up? ---- --readdb /home/kangas/nutch-cvs/TEST/db -dumplinks expr: syntax error 050115 020227 loading file:/home/kangas/nutch-cvs/nutch/conf/nutch-default.xml 050115 020227 loading file:/home/kangas/nutch-cvs/nutch/conf/nutch-site.xml [EMAIL PROTECTED] from file:/home/kangas/nutch-cvs/TEST/htdocs/index.html to file:/home/kangas/nutch-cvs/TEST/htdocs/eggs1.html to file:/home/kangas/nutch-cvs/TEST/htdocs/eggs2.html to file:/home/kangas/nutch-cvs/TEST/htdocs/eggs3.html to file:/home/kangas/nutch-cvs/TEST/htdocs/eggs4.html ---- (PS: the "expr: syntax error" comes from the nutch script's "cygwin path translation" line when run on FreeBSD -- seems mostly harmless..) --Matt
test_segnormal.sh
Description: Binary data
test_segnormal.out
Description: Binary data
