Can you post some references to this work? I tried a google search for "Jan Daciuk MFSA" and didn't find anything relevant. J
>>> [EMAIL PROTECTED] 10/07/03 05:13PM >>> Hi Pete, IMHO you could also use stemmers which are 1) faster 2) more accurate 3) able to learn and process *any* language 4) able to work as lemmatiser/guesser. I know two algorithms which have all the properties: The first one is based on Jan Daciuk's MFSA, and the second one is, ehm no self-promotion ;-), my method. The comparison of these two methods is here: http://www.egothor.org/temp/us-0E2-cmp.png (English dictionary) My method was designed for IR systems thus it gives better accuracy in such environments. I was also interested in compound words (->German) thus I can offer you a multilevel stemmer which do the job. Elsewhere you may have better results with Jan's method. Leo Pete Lewis wrote: >Hi all > >I know that I have no vote but I think that it would be wrong to bring the >SnowballAnalyzer into the core. > >There are some distinct limitations with this pure algorithmic approach. Yes it >would be great to say 'hey, we have 14 languages covered' but you should first >realise the limitations of the product. Lets start with some definitions.... > >'Stemming' signifies the process of finding the stems in words. 'Lemmatisation' is >the process of reducing the word form to its 'lemma' form, i.e. the form one expects >to find in a dictionary. The difference are: > >1. In many language the dictionary form is not the stem. E.g. in Dutch the >infinitive verb is not its stem. > >2. Words may have several stems due to composition (common in Germanic >languages). > >The terms are both used extremely loosely in the literature, where they often >indicate the same thing. > > > >A tool often used for English is the Porter-stemmer. Strictly speaking, it is neither >a stemmer nor a lemmatiser; it cuts off certain characters on the basis of characters >before them. In many cases morphologically equivalent forms reduce to the same root >form. There have been efforts to create similar type algorithmic tools for other >languages. Porter has lately designed a language called Snowball, to create scripts >for performing these reductions. Snowball has been applied for a number of languages. >In many cases these scripts are available for the public. Snowball is not capable of >handling composition. Nor is it capable of handling other more demanding >morphological patterns, such as agglutination and infixes. > > > >Basically people would expect the terms in the search clue to be reduced to the same >root form as that used for indexing and hence would then be able to find the >different derivations of the term (plurals etc). > > > >Some examples from Snowball should speak for themselves: > > > >bus -> bus > >buses -> buse > >catch -> catch > >caught -> caught > >manage -> manag > >management -> manag > > > >showing incorrect handling of plurals, irregs, and mixing verbs & nouns. Obviously >many other examples can be found. > > > >While this isn't too bad for English it gets pretty dire for other languages. > > > >For English I'd prefer KStem rather than Snowball. > > > >Cheers > > > >Pete > > > > > >----- Original Message ----- >From: "Erik Hatcher" <[EMAIL PROTECTED]> >To: "Lucene List" <[EMAIL PROTECTED]> >Sent: Monday, October 06, 2003 6:49 PM >Subject: SnowballAnalyzer > > > > >>At one point, I believe, it was proposed to bring the sandbox >>SnowballAnalyzer into the core. Is this still desired or shall we just >>leave it in the sandbox? >> >>Erik >> >> >>--------------------------------------------------------------------- >>To unsubscribe, e-mail: [EMAIL PROTECTED] >>For additional commands, e-mail: [EMAIL PROTECTED] >> >> >> >> > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ------------------------------------------------------------------------------ This message may contain confidential information, and is intended only for the use of the individual(s) to whom it is addressed. ==============================================================================
