On Wed, Nov 10, 2010 at 12:10:51PM -0500, Robert Muir wrote: > One more note that I forgot to mention: in snowball's svn (but i think not > in the libstemmer pkg) there is actually vocabulary test data: input files > containing a sample vocabulary for each language, expected output, and > combined files called 'diffs' that show what the stemmer changes. > > these provide pretty good coverage for tests to ensure your > integration is working... when they make a change to the algorithms > these are updated too (though it seems not always in the same commit): > > example: > http://svn.tartarus.org/snowball/trunk/data/german/diffs.txt?r1=527&r2=526&pathrev=527
I used this sample data to prepare tests for the Lingua::Stem::Snowball CPAN distribution. Now that we are bundling the Snowball C libraries, we are no longer benefitting by proxy from that test suite, and we should roll our own tests. Yesterday, I adapted the update_snowstem.pl script in <https://issues.apache.org/jira/browse/LUCY-125> to work off of an svn checkout of Snowball; I committed the patches and closed the issue this morning. Now I'll go add test data generation to update_snowstem.pl's capabilities and add new test files for each language to validate that our stemmers work properly. Thanks for bringing it up! Marvin Humphrey
