[ 
https://issues.apache.org/jira/browse/LUCENE-5778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039807#comment-14039807
 ] 

ASF subversion and git services commented on LUCENE-5778:
---------------------------------------------------------

Commit 1604354 from [~rcmuir] in branch 'dev/trunk'
[ https://svn.apache.org/r1604354 ]

LUCENE-5778: support hunspell morphological description fields

> Support hunspell morphological description fields
> -------------------------------------------------
>
>                 Key: LUCENE-5778
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5778
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Robert Muir
>             Fix For: 5.0, 4.10
>
>         Attachments: LUCENE-5778.patch
>
>
> Currently hunspell stemmer doesn't support these (particularly the st:XYZ 
> field which signifies a stemming "exception" basically).
> For example in english "feet" might have "st:foot".
> These can be encoded two ways, inline into the .dic or aliased via AM entries 
> from the .aff.
> Unfortunately, our parsing was really lenient and in order to do this 
> properly (e.g. handling words with spaces and morphological fields containing 
> slashes and all that jazz), it had to be cleaned up a bit to follow the 
> hunspell rules.
> For now, we dont waste space with part of speech and only concern ourselves 
> with the "st:" field and the stemmer uses it transparently. 
> Encoding these exceptions is a little complicated because these exceptions 
> are rarely used, but when they are, they are typically common verbs and stuff 
> (like english 'be'), so we dont want it to be slow. 
> They are also not "per-word" but "per-form", so you could have homonyms with 
> different stems (at least theoretically). 
> On the other hand this is silly stuff particular to these silly languages, so 
> we dont want it to blow up the datastructure for 99% of languages that dont 
> use it.
> So the way we do it is to just store the exception ID alongside the form ID 
> (this doubles the intsref, which is usually 1). So for e.g. english i think 
> it typically boils down to an extra byte or so in the FST and doesn't blow 
> up. For languages not using this stuff there is no impact.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to