[ https://issues.apache.org/jira/browse/LUCENE-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828054#action_12828054 ]
Robert Muir commented on LUCENE-2055: ------------------------------------- here is a short explanation of what i figure might be the controversial part: adding all the language-specific analyzers: I think its too difficult for a non-english user to use lucene. Let's take the romanian case, sure its supported by SnowballAnalyzer, but: * where are the stopwords? if the user is smart enough they can google this and find savoy's list... but it contains some stray nouns that should not be in there, and will they get the encoding correct? * for some languages: french, dutch, turkish: we already want to do something different already. For french we need the elision filter to tokenize correctly, for dutch, the special dictionary-based exclusions (I have been told by some any stemmer that does not handle fiets correct is useless), for turkish we need the special lowercasing. * for other languages: german, swedish, ... i think we REALLY want to implement decompounding support in the future. For german at least, there is a public domain wordlist just itching to be used for this. * oh yeah, and all the javadocs are in english, so writing your own analyzer is another barrier to entry. So I think instead its best to have a "recommended default" organized by language, preferably one we have relevance tested / or is already published. many of the existing snowball stemmers have published relevance results available already, thus my bias towards them. Sure it won't meet everyones needs, and users should still think about using them as a template, but I think digging up your own stoplist / writing your own analyzer, figuring out your language support is really buried in snowball, combined with documentation not in your native tongue, i think this adds up to a barrier to entry that is simply too high. > Fix buggy stemmers and Remove duplicate analysis functionality > -------------------------------------------------------------- > > Key: LUCENE-2055 > URL: https://issues.apache.org/jira/browse/LUCENE-2055 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/analyzers > Reporter: Robert Muir > Fix For: 3.1 > > Attachments: LUCENE-2055.patch > > > would like to remove stemmers in the following packages, and instead in their > analyzers use a SnowballStemFilter instead. > * analyzers/fr > * analyzers/nl > * analyzers/ru > below are excerpts from this code where they proudly proclaim they use the > snowball algorithm. > I think we should delete all of this custom stemming code in favor of the > actual snowball package. > {noformat} > /** > * A stemmer for French words. > * <p> > * The algorithm is based on the work of > * Dr Martin Porter on his snowball project<br> > * refer to http://snowball.sourceforge.net/french/stemmer.html<br> > * (French stemming algorithm) for details > * </p> > */ > public class FrenchStemmer { > /** > * A stemmer for Dutch words. > * <p> > * The algorithm is an implementation of > * the <a > href="http://snowball.tartarus.org/algorithms/dutch/stemmer.html">dutch > stemming</a> > * algorithm in Martin Porter's snowball project. > * </p> > */ > public class DutchStemmer { > /** > * Russian stemming algorithm implementation (see > http://snowball.sourceforge.net for detailed description). > */ > class RussianStemmer > {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org