Folks, here's a Java version of my previous test script for a yet-to-be-written SegmentNormalizeTool. Sample usage:
$ java foobit.nutch.TestSegmentNormalizeTool /tmp -init $ java foobit.nutch.TestSegmentNormalizeTool /tmp -test "-init" writes a set of HTML files to an htdocs directory, then performs a synthetic crawl on them to produce a webdb and a set of segments. "-test" walks the set of segments, comparing each document's score and anchors to those in the webdb and reporting mismatches. Unfortunately, this is not yet useful as a unit test. The whole "synthetic crawl" process is heavily dependent on local Nutch config settings, like having protocol-file enabled, and I have not yet found a clean way to override these for the scope of one test. Suggestions for improving this would be appreciated. Xin-Yi, how's your tool coming along? Ready for testing? --Matt
TestSegmentNormalizeTool.java
Description: Binary data
