On Mon, Nov 8, 2010 at 10:01 PM, Marvin Humphrey <[email protected]> wrote: > The "semiclean" build target has been added. I opened > <https://issues.apache.org/jira/browse/LUCY-125> for bundling the Snowball > stemming library. A separate issue will follow for bundling the stoplists. >
Some quick notes, from lucene-java: * are you going to do svn checkouts for bundling snowball? I don't think they are really releasing anymore, but there are in fact new languages, etc in svn. * every so often snowball makes changes to the rules for the languages.. this can be tricky depending on how you handle backwards compatibility. In lucene java we have a checkout of revision 502, but then with the newer languages added (Armenian, Catalan, Basque)... if we fully 'svn updated' to the latest rev it would change things about german stemming from our previous release, for example, and be a hassle for people who created indexes with those older versions. * when bundling the stoplists: there are some languages, even "released" ones (Turkish, Romanian, etc) that don't have snowball-included stoplists. if you want, you could use the ones we have in lucene to provide stoplists for these languages... http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/tr/stopwords.txt http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ro/stopwords.txt http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/hy/stopwords.txt http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/eu/stopwords.txt http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ca/stopwords.txt these are of variable quality: the ones with source information in the header means that I found one clearly marked with BSD or Apache. If they have no header, it means i made them myself... it might seem absurd to worry about "licensing" for stopwords, but you never know :)
