On Tue, Nov 9, 2010 at 3:53 PM, Marvin Humphrey <[email protected]> wrote: > On Tue, Nov 09, 2010 at 04:51:33AM -0500, Robert Muir wrote: >> Some quick notes, from lucene-java: >
One more note that I forgot to mention: in snowball's svn (but i think not in the libstemmer pkg) there is actually vocabulary test data: input files containing a sample vocabulary for each language, expected output, and combined files called 'diffs' that show what the stemmer changes. these provide pretty good coverage for tests to ensure your integration is working... when they make a change to the algorithms these are updated too (though it seems not always in the same commit): example: http://svn.tartarus.org/snowball/trunk/data/german/diffs.txt?r1=527&r2=526&pathrev=527
