Re: [lingu-dev] thesaurus help

nemeth Fri, 17 Nov 2006 19:16:08 -0800

Hi Kevin,

Quoting Kevin Scannell <[EMAIL PROTECTED]>:


>
> Hi everyone,
>
> Is there any support for morphology in the existing
> thesaurus code?  If I select the word "cats" in a text,
> will it give me synonyms for "cat"?

No, unfortunatelly, not.
>
> If not, are there plans to develop something like
> this?  It seems like it might be something that
> could be achieved with calls to hunspell, at least
> for those languages which have some morphology
> encoded in an affix file.

There is a plan. See http://qa.openoffice.org/issues/show_bug.cgi?id=19563

>
> An alternate approach would be to include cross-references
> in my thesaurus from inflected words to the root form,
> but this would be a big waste of memory for Irish.  Many
> of the variants that need to be handled involve
> only small orthographic changes (e.g. woman = "bean"
> in the dictionary, but occasionally written "bhean" or "mbean").
> When you multiply these by the "real" morphological
> variants, there are a _lot_ of forms to be added.
>
> Any advice appreciated.

It would be fine to add a lightweight stemmer to the
thesaurus code combining rule-based stemming with suggestion
filtering by the dictionary (thesaurus data).

For instance, the following code (without prefix support and
exception data) works well on English words:

    # suffix replacement table
       suffix = [
            [ '(ive|ion|ions|ing|ings|est|er|ers)$', 'e' ],
            [ '(ive|en|th|ly|ing|d|est|er|s)$', '' ],
            [ '(ens|ness|y)$', '' ],
            [ '(ings|ed|ers|es)$', ''],
            [ r'(?<=(.))\1(ing|ings|ed|y)$', ''], # running -> run
            [ 'ck(ing|ings|ed)$', 'c'],
            [ '(ication|ied|iest|ier|iness)$', 'y'],
            [ '(ications|ies|iers|ieth)$', 'y'],
            [ 'ves$', 'f'],
            [ 'ves$', 'fe' ]]

   # return the possible stems of the given word
    def get_stems(self, word):
        stems = [ word ]
        for i in suffix:
            r = re.compile(i[0])
            if r.search(word):
                stems = stems + [ r.sub(i[1],word) ]
        return stems

   # filtering by dictionary
   ...

Using Hunspell would be better for morphologically complex
languages and for suffixation of the suggested synonyms. Or we need both,
Hunspell, and a probability stemmer with metaphon-like operations
supporting many Irish ortographies at the same time.

Best regards,

Laci



>
> -Kevin
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>




----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [lingu-dev] thesaurus help

Reply via email to