[
https://issues.apache.org/jira/browse/LUCENE-5778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039807#comment-14039807
]
ASF subversion and git services commented on LUCENE-5778:
---------------------------------------------------------
Commit 1604354 from [~rcmuir] in branch 'dev/trunk'
[ https://svn.apache.org/r1604354 ]
LUCENE-5778: support hunspell morphological description fields
> Support hunspell morphological description fields
> -------------------------------------------------
>
> Key: LUCENE-5778
> URL: https://issues.apache.org/jira/browse/LUCENE-5778
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Robert Muir
> Fix For: 5.0, 4.10
>
> Attachments: LUCENE-5778.patch
>
>
> Currently hunspell stemmer doesn't support these (particularly the st:XYZ
> field which signifies a stemming "exception" basically).
> For example in english "feet" might have "st:foot".
> These can be encoded two ways, inline into the .dic or aliased via AM entries
> from the .aff.
> Unfortunately, our parsing was really lenient and in order to do this
> properly (e.g. handling words with spaces and morphological fields containing
> slashes and all that jazz), it had to be cleaned up a bit to follow the
> hunspell rules.
> For now, we dont waste space with part of speech and only concern ourselves
> with the "st:" field and the stemmer uses it transparently.
> Encoding these exceptions is a little complicated because these exceptions
> are rarely used, but when they are, they are typically common verbs and stuff
> (like english 'be'), so we dont want it to be slow.
> They are also not "per-word" but "per-form", so you could have homonyms with
> different stems (at least theoretically).
> On the other hand this is silly stuff particular to these silly languages, so
> we dont want it to blow up the datastructure for 99% of languages that dont
> use it.
> So the way we do it is to just store the exception ID alongside the form ID
> (this doubles the intsref, which is usually 1). So for e.g. english i think
> it typically boils down to an extra byte or so in the FST and doesn't blow
> up. For languages not using this stuff there is no impact.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]