[jira] [Created] (LUCENE-5778) Support hunspell morphological description fields

Robert Muir (JIRA) Thu, 19 Jun 2014 08:40:42 -0700

Robert Muir created LUCENE-5778:
-----------------------------------

             Summary: Support hunspell morphological description fields
                 Key: LUCENE-5778
                 URL: https://issues.apache.org/jira/browse/LUCENE-5778
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Robert Muir
         Attachments: LUCENE-5778.patch


Currently hunspell stemmer doesn't support these (particularly the st:XYZ field 
which signifies a stemming "exception" basically).

For example in english "feet" might have "st:foot".

These can be encoded two ways, inline into the .dic or aliased via AM entries 
from the .aff.

Unfortunately, our parsing was really lenient and in order to do this properly 
(e.g. handling words with spaces and morphological fields containing slashes 
and all that jazz), it had to be cleaned up a bit to follow the hunspell rules.

For now, we dont waste space with part of speech and only concern ourselves 
with the "st:" field and the stemmer uses it transparently. 

Encoding these exceptions is a little complicated because these exceptions are 
rarely used, but when they are, they are typically common verbs and stuff (like 
english 'be'), so we dont want it to be slow. 
They are also not "per-word" but "per-form", so you could have homonyms with 
different stems (at least theoretically). 
On the other hand this is silly stuff particular to these silly languages, so 
we dont want it to blow up the datastructure for 99% of languages that dont use 
it.

So the way we do it is to just store the exception ID alongside the form ID 
(this doubles the intsref, which is usually 1). So for e.g. english i think it 
typically boils down to an extra byte or so in the FST and doesn't blow up. For 
languages not using this stuff there is no impact.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (LUCENE-5778) Support hunspell morphological description fields

Reply via email to