Hi all

I know that I have no vote but I think that it would be wrong to bring the 
SnowballAnalyzer into the core.

There are some distinct limitations with this pure algorithmic approach.  Yes it would 
be great to say 'hey, we have 14 languages covered' but you should first realise the 
limitations of the product.  Lets start with some definitions....

'Stemming' signifies the process of finding the stems in words. 'Lemmatisation' is the 
process of reducing the word form to its 'lemma' form, i.e. the form one expects to 
find in a dictionary. The difference are:

1.      In many language the dictionary form is not the stem. E.g. in Dutch the 
infinitive verb is not its stem.

2.      Words may have several stems due to composition (common in Germanic languages).

The terms are both used extremely loosely in the literature, where they often indicate 
the same thing.



A tool often used for English is the Porter-stemmer. Strictly speaking, it is neither 
a stemmer nor a lemmatiser; it cuts off certain characters on the basis of characters 
before them. In many cases morphologically equivalent forms reduce to the same root 
form. There have been efforts to create similar type algorithmic tools for other 
languages. Porter has lately designed a language called Snowball, to create scripts 
for performing these reductions. Snowball has been applied for a number of languages. 
In many cases these scripts are available for the public. Snowball is not capable of 
handling composition. Nor is it capable of handling other more demanding morphological 
patterns, such as agglutination and infixes.



Basically people would expect the terms in the search clue to be reduced to the same 
root form as that used for indexing and hence would then be able to find the different 
derivations of the term (plurals etc).



Some examples from Snowball should speak for themselves:



bus -> bus

buses -> buse

catch -> catch

caught -> caught

manage -> manag

management -> manag



showing incorrect handling of plurals, irregs, and mixing verbs & nouns.  Obviously 
many other examples can be found.



While this isn't too bad for English it gets pretty dire for other languages.



For English I'd prefer KStem rather than Snowball.



Cheers



Pete





----- Original Message ----- 
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene List" <[EMAIL PROTECTED]>
Sent: Monday, October 06, 2003 6:49 PM
Subject: SnowballAnalyzer


> At one point, I believe, it was proposed to bring the sandbox 
> SnowballAnalyzer into the core.  Is this still desired or shall we just 
> leave it in the sandbox?
> 
> Erik
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

Reply via email to