[jira] Commented: (LUCENE-2055) Fix buggy stemmers and Remove duplicate analysis functionality

Robert Muir (JIRA) Mon, 01 Feb 2010 01:58:25 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828054#action_12828054
 ]


Robert Muir commented on LUCENE-2055:
-------------------------------------

here is a short explanation of what i figure might be the controversial part: 
adding all the language-specific analyzers:

I think its too difficult for a non-english user to use lucene. 
Let's take the romanian case, sure its supported by SnowballAnalyzer, but:
* where are the stopwords? if the user is smart enough they can google this and 
find savoy's list... but it contains some stray nouns that should not be in 
there, and will they get the encoding correct?
* for some languages: french, dutch, turkish: we already want to do something 
different already. For french we need the elision filter to tokenize correctly, 
for dutch, the special dictionary-based exclusions (I have been told by some 
any stemmer that does not handle fiets correct is useless), for turkish we need 
the special lowercasing.
* for other languages: german, swedish, ... i think we REALLY want to implement 
decompounding support in the future. For german at least, there is a public 
domain wordlist just itching to be used for this.
* oh yeah, and all the javadocs are in english, so writing your own analyzer is 
another barrier to entry.

So I think instead its best to have a "recommended default" organized by 
language, preferably one we have relevance tested / or is already published. 
many of the existing snowball stemmers have published relevance results 
available already, thus my bias towards them. Sure it won't meet everyones 
needs, and users should still think about using them as a template, but I think 
digging up your own stoplist / writing your own analyzer, figuring out your 
language support is really buried in snowball, combined with documentation not 
in your native tongue, i think this adds up to a barrier to entry that is 
simply too high.



> Fix buggy stemmers and Remove duplicate analysis functionality
> --------------------------------------------------------------
>
>                 Key: LUCENE-2055
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2055
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>             Fix For: 3.1
>
>         Attachments: LUCENE-2055.patch
>
>
> would like to remove stemmers in the following packages, and instead in their 
> analyzers use a SnowballStemFilter instead.
> * analyzers/fr
> * analyzers/nl
> * analyzers/ru
> below are excerpts from this code where they proudly proclaim they use the 
> snowball algorithm.
> I think we should delete all of this custom stemming code in favor of the 
> actual snowball package.
> {noformat}
> /**
>  * A stemmer for French words. 
>  * <p>
>  * The algorithm is based on the work of
>  * Dr Martin Porter on his snowball project<br>
>  * refer to http://snowball.sourceforge.net/french/stemmer.html<br>
>  * (French stemming algorithm) for details
>  * </p>
>  */
> public class FrenchStemmer {
> /**
>  * A stemmer for Dutch words. 
>  * <p>
>  * The algorithm is an implementation of
>  * the <a 
> href="http://snowball.tartarus.org/algorithms/dutch/stemmer.html";>dutch 
> stemming</a>
>  * algorithm in Martin Porter's snowball project.
>  * </p>
>  */
> public class DutchStemmer {
> /**
>  * Russian stemming algorithm implementation (see 
> http://snowball.sourceforge.net for detailed description).
>  */
> class RussianStemmer
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2055) Fix buggy stemmers and Remove duplicate analysis functionality

Reply via email to