[jira] [Commented] (LUCENE-4542) Make RECURSION_CAP in HunspellStemmer configurable

Lukas Vlcek (JIRA) Wed, 18 Sep 2013 05:31:55 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13770721#comment-13770721
 ]


Lukas Vlcek commented on LUCENE-4542:
-------------------------------------

IIRC the hunspell stemmer works basically the following way:

1. Assuming input token is not a root form of the word it scans affix rules 
(.aff file) and try to identify possible rules that could have been used to 
produce the input token.
2. Apply each found rule to the input token to get one or more output tokens. 
The output tokens can be considered candidates for the word in root form.
3. If any of the candidates is found in the dictionary (.dic file) and 
application of particular rule is allowed (see the regexp pattern in .aff file) 
then bingo! If not goto #1 until RECURSION_CAP level is reached.

This way you can have `nongoodnesses` stemmed to `good` (providing 
RECURSION_CAP=2). Depending on the dictionary and affix rules you may need one 
pass to get from `nongoodnesses` to `goodnesses` and then two other passes to 
get from `goodnesses` to `goodness` and then from `goodness` to `good`. 
(Probably not the best example)

However, this is all very depending on particular dictionary and affix rules.

For example I realized that czech (ispell) or slovak (hunspell) dictionaries 
are constructed in a different way (though still a way that feels natural to 
the language itself) and only a single pass works best for them (although 
single pass does not allow for handling both prefix AND suffix at the same 
time).

In my opinion there is a lot that could be improved in the hunspell token 
filter, but it is more linguistic matter then algorithmic.
                
> Make RECURSION_CAP in HunspellStemmer configurable
> --------------------------------------------------
>
>                 Key: LUCENE-4542
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4542
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 4.0
>            Reporter: Piotr
>            Assignee: Steve Rowe
>             Fix For: 5.0, 4.4
>
>         Attachments: Lucene-4542-javadoc.patch, LUCENE-4542.patch, 
> LUCENE-4542-with-solr.patch
>
>
> Currently there is 
> private static final int RECURSION_CAP = 2;
> in the code of the class HunspellStemmer. It makes using hunspell with 
> several dictionaries almost unusable, due to bad performance (f.ex. it costs 
> 36ms to stem long sentence in latvian for recursion_cap=2 and 5 ms for 
> recursion_cap=1). It would be nice to be able to tune this number as needed.
> AFAIK this number (2) was chosen arbitrary.
> (it's a first issue in my life, so please forgive me any mistakes done).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4542) Make RECURSION_CAP in HunspellStemmer configurable

Reply via email to