Folks, here's a Java version of my previous test script for a
yet-to-be-written SegmentNormalizeTool. Sample usage:

$ java foobit.nutch.TestSegmentNormalizeTool /tmp -init
$ java foobit.nutch.TestSegmentNormalizeTool /tmp -test

"-init" writes a set of HTML files to an htdocs directory, then
performs a synthetic crawl on them to produce a webdb and a set of
segments. "-test" walks the set of segments, comparing each document's
score and anchors to those in the webdb and reporting mismatches.

Unfortunately, this is not yet useful as a unit test. The whole
"synthetic crawl" process is heavily dependent on local Nutch config
settings, like having protocol-file enabled, and I have not yet found
a clean way to override these for the scope of one test. Suggestions
for improving this would be appreciated.

Xin-Yi, how's your tool coming along? Ready for testing?

--Matt

Attachment: TestSegmentNormalizeTool.java
Description: Binary data

Reply via email to