Hi Kevin, Quoting Kevin Scannell <[EMAIL PROTECTED]>:
> > Hi everyone, > > Is there any support for morphology in the existing > thesaurus code? If I select the word "cats" in a text, > will it give me synonyms for "cat"? No, unfortunatelly, not. > > If not, are there plans to develop something like > this? It seems like it might be something that > could be achieved with calls to hunspell, at least > for those languages which have some morphology > encoded in an affix file. There is a plan. See http://qa.openoffice.org/issues/show_bug.cgi?id=19563 > > An alternate approach would be to include cross-references > in my thesaurus from inflected words to the root form, > but this would be a big waste of memory for Irish. Many > of the variants that need to be handled involve > only small orthographic changes (e.g. woman = "bean" > in the dictionary, but occasionally written "bhean" or "mbean"). > When you multiply these by the "real" morphological > variants, there are a _lot_ of forms to be added. > > Any advice appreciated. It would be fine to add a lightweight stemmer to the thesaurus code combining rule-based stemming with suggestion filtering by the dictionary (thesaurus data). For instance, the following code (without prefix support and exception data) works well on English words: # suffix replacement table suffix = [ [ '(ive|ion|ions|ing|ings|est|er|ers)$', 'e' ], [ '(ive|en|th|ly|ing|d|est|er|s)$', '' ], [ '(ens|ness|y)$', '' ], [ '(ings|ed|ers|es)$', ''], [ r'(?<=(.))\1(ing|ings|ed|y)$', ''], # running -> run [ 'ck(ing|ings|ed)$', 'c'], [ '(ication|ied|iest|ier|iness)$', 'y'], [ '(ications|ies|iers|ieth)$', 'y'], [ 'ves$', 'f'], [ 'ves$', 'fe' ]] # return the possible stems of the given word def get_stems(self, word): stems = [ word ] for i in suffix: r = re.compile(i[0]) if r.search(word): stems = stems + [ r.sub(i[1],word) ] return stems # filtering by dictionary ... Using Hunspell would be better for morphologically complex languages and for suffixation of the suggested synonyms. Or we need both, Hunspell, and a probability stemmer with metaphon-like operations supporting many Irish ortographies at the same time. Best regards, Laci > > -Kevin > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
