Re: SnowballAnalyzer

Jim Hargrave Wed, 08 Oct 2003 07:48:09 -0700

Can you post some references to this work? I tried a google search for "Jan Daciuk 
MFSA" and didn't find anything relevant.
 
J


>>> [EMAIL PROTECTED] 10/07/03 05:13PM >>>
Hi Pete,

IMHO you could also use stemmers which are 1) faster 2) more accurate 3) 
able to learn and process *any* language 4) able to work as 
lemmatiser/guesser. I know two algorithms which have all the properties:

The first one is based on Jan Daciuk's MFSA, and the second one is, ehm 
no self-promotion ;-), my method. The comparison of these two methods is 
here: http://www.egothor.org/temp/us-0E2-cmp.png (English dictionary)

My method was designed for IR systems thus it gives better accuracy in 
such environments. I was also interested in compound words (->German) 
thus I can offer you a multilevel stemmer which do the job. Elsewhere 
you may have better results with Jan's method.

Leo

Pete Lewis wrote:

>Hi all
>
>I know that I have no vote but I think that it would be wrong to bring the 
>SnowballAnalyzer into the core.
>
>There are some distinct limitations with this pure algorithmic approach.  Yes it 
>would be great to say 'hey, we have 14 languages covered' but you should first 
>realise the limitations of the product.  Lets start with some definitions....
>
>'Stemming' signifies the process of finding the stems in words. 'Lemmatisation' is 
>the process of reducing the word form to its 'lemma' form, i.e. the form one expects 
>to find in a dictionary. The difference are:
>
>1.      In many language the dictionary form is not the stem. E.g. in Dutch the 
>infinitive verb is not its stem.
>
>2.      Words may have several stems due to composition (common in Germanic 
>languages).
>
>The terms are both used extremely loosely in the literature, where they often 
>indicate the same thing.
>
>
>
>A tool often used for English is the Porter-stemmer. Strictly speaking, it is neither 
>a stemmer nor a lemmatiser; it cuts off certain characters on the basis of characters 
>before them. In many cases morphologically equivalent forms reduce to the same root 
>form. There have been efforts to create similar type algorithmic tools for other 
>languages. Porter has lately designed a language called Snowball, to create scripts 
>for performing these reductions. Snowball has been applied for a number of languages. 
>In many cases these scripts are available for the public. Snowball is not capable of 
>handling composition. Nor is it capable of handling other more demanding 
>morphological patterns, such as agglutination and infixes.
>
>
>
>Basically people would expect the terms in the search clue to be reduced to the same 
>root form as that used for indexing and hence would then be able to find the 
>different derivations of the term (plurals etc).
>
>
>
>Some examples from Snowball should speak for themselves:
>
>
>
>bus -> bus
>
>buses -> buse
>
>catch -> catch
>
>caught -> caught
>
>manage -> manag
>
>management -> manag
>
>
>
>showing incorrect handling of plurals, irregs, and mixing verbs & nouns.  Obviously 
>many other examples can be found.
>
>
>
>While this isn't too bad for English it gets pretty dire for other languages.
>
>
>
>For English I'd prefer KStem rather than Snowball.
>
>
>
>Cheers
>
>
>
>Pete
>
>
>
>
>
>----- Original Message ----- 
>From: "Erik Hatcher" <[EMAIL PROTECTED]>
>To: "Lucene List" <[EMAIL PROTECTED]>
>Sent: Monday, October 06, 2003 6:49 PM
>Subject: SnowballAnalyzer
>
>
>  
>
>>At one point, I believe, it was proposed to bring the sandbox 
>>SnowballAnalyzer into the core.  Is this still desired or shall we just 
>>leave it in the sandbox?
>>
>>Erik
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: [EMAIL PROTECTED] 
>>For additional commands, e-mail: [EMAIL PROTECTED] 
>>
>>    
>>
>> 
>  
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 





------------------------------------------------------------------------------
This message may contain confidential information, and is intended only for the use of 
the individual(s) to whom it is addressed.


==============================================================================

Re: SnowballAnalyzer

Reply via email to