Hi Pete,

IMHO you could also use stemmers which are 1) faster 2) more accurate 3) able to learn and process *any* language 4) able to work as lemmatiser/guesser. I know two algorithms which have all the properties:

The first one is based on Jan Daciuk's MFSA, and the second one is, ehm no self-promotion ;-), my method. The comparison of these two methods is here: http://www.egothor.org/temp/us-0E2-cmp.png (English dictionary)

My method was designed for IR systems thus it gives better accuracy in such environments. I was also interested in compound words (->German) thus I can offer you a multilevel stemmer which do the job. Elsewhere you may have better results with Jan's method.

Leo

Pete Lewis wrote:

Hi all

I know that I have no vote but I think that it would be wrong to bring the SnowballAnalyzer into the core.

There are some distinct limitations with this pure algorithmic approach. Yes it would be great to say 'hey, we have 14 languages covered' but you should first realise the limitations of the product. Lets start with some definitions....

'Stemming' signifies the process of finding the stems in words. 'Lemmatisation' is the process of reducing the word form to its 'lemma' form, i.e. the form one expects to find in a dictionary. The difference are:

1. In many language the dictionary form is not the stem. E.g. in Dutch the infinitive verb is not its stem.

2. Words may have several stems due to composition (common in Germanic languages).

The terms are both used extremely loosely in the literature, where they often indicate the same thing.



A tool often used for English is the Porter-stemmer. Strictly speaking, it is neither a stemmer nor a lemmatiser; it cuts off certain characters on the basis of characters before them. In many cases morphologically equivalent forms reduce to the same root form. There have been efforts to create similar type algorithmic tools for other languages. Porter has lately designed a language called Snowball, to create scripts for performing these reductions. Snowball has been applied for a number of languages. In many cases these scripts are available for the public. Snowball is not capable of handling composition. Nor is it capable of handling other more demanding morphological patterns, such as agglutination and infixes.



Basically people would expect the terms in the search clue to be reduced to the same root form as that used for indexing and hence would then be able to find the different derivations of the term (plurals etc).



Some examples from Snowball should speak for themselves:



bus -> bus

buses -> buse

catch -> catch

caught -> caught

manage -> manag

management -> manag



showing incorrect handling of plurals, irregs, and mixing verbs & nouns. Obviously many other examples can be found.



While this isn't too bad for English it gets pretty dire for other languages.



For English I'd prefer KStem rather than Snowball.



Cheers



Pete





----- Original Message ----- From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene List" <[EMAIL PROTECTED]>
Sent: Monday, October 06, 2003 6:49 PM
Subject: SnowballAnalyzer





At one point, I believe, it was proposed to bring the sandbox SnowballAnalyzer into the core. Is this still desired or shall we just leave it in the sandbox?

Erik


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]









--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to