[jira] [Commented] (LUCENE-5468) Hunspell very high memory use when loading dictionary

Lukas Vlcek (JIRA) Thu, 27 Feb 2014 15:49:02 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13915222#comment-13915222
 ]


Lukas Vlcek commented on LUCENE-5468:
-------------------------------------

Robert,

I did not check the latest code so please forgive my ignorance but let me try 
to explain:

recursionCap does not distinguishes between how many prefix and suffix rules 
were applied. Does it? It counts just the total. If I set recursionCap to 1 it 
actually includes all the following options:
- 2 prefix rules, 0 suffix rules
- 1prefix rule, 1 suffix rule
- 0 prefix rules, 2 suffix rules

This may not play well with some affix rule dictionaries. For example the czech 
dictionary is constructed in such a way that only one suffix rule can be 
applied otherwise the filter can generate irrelevant tokens. So the 
recursionCap MUST be set to 0.
However, if this recursion level is consumed on removal of prefix then it can 
not continue and manipulate also the suffix. So what I am proposing is having 
an option to set recursionCap separately for prefix and suffix. In case of 
czech dict I would say: you can apply only one prefix rule and only one suffix 
rule (meaning you can NEVER apply two prefix rules or two affix rules).

As for ignoreCase - how does it work if the dictionary contains terms like "Xx" 
and "xx" and each is allowed to use different set of rules? I need to 
distinguish between them. But at the same hand if the dictionary contains only 
"yy" but I get "Yy" as input (because it was the first word of the sentence) 
would it be able to process it correctly and still distinguish between "Xx" and 
"xx"?

As for the last feature I probably confused you. What I am looking for is not 
output of all possible root words for given term but all possible inflections 
for given (root) word. For example: input is "tell" and based on loaded 
dictionary the output would be ["tell","tells","telling", ...]. I think it wold 
not be hard to expose such API and I believe users would appreciate this when 
constructing custom dictionaries (I tried that and I was missing such feature, 
sure I can implement it myself but I believe having it in Solr and 
Elasticsearch would be great, definitely this is not useful for indexing 
process but as a part of tuning your dictionary this would be helpful).

> Hunspell very high memory use when loading dictionary
> -----------------------------------------------------
>
>                 Key: LUCENE-5468
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5468
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 3.5
>            Reporter: Maciej Lisiewski
>            Priority: Minor
>             Fix For: 4.8, 5.0
>
>         Attachments: LUCENE-5468.patch, patch.txt
>
>
> Hunspell stemmer requires gigantic (for the task) amounts of memory to load 
> dictionary/rules files. 
> For example loading a 4.5 MB polish dictionary (with empty index!) will cause 
> whole core to crash with various out of memory errors unless you set max heap 
> size close to 2GB or more.
> By comparison Stempel using the same dictionary file works just fine with 1/8 
> of that (and possibly lower values as well).
> Sample error log entries:
> http://pastebin.com/fSrdd5W1
> http://pastebin.com/Lmi0re7Z



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5468) Hunspell very high memory use when loading dictionary

Reply via email to