[
https://issues.apache.org/jira/browse/LUCENE-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13915222#comment-13915222
]
Lukas Vlcek commented on LUCENE-5468:
-------------------------------------
Robert,
I did not check the latest code so please forgive my ignorance but let me try
to explain:
recursionCap does not distinguishes between how many prefix and suffix rules
were applied. Does it? It counts just the total. If I set recursionCap to 1 it
actually includes all the following options:
- 2 prefix rules, 0 suffix rules
- 1prefix rule, 1 suffix rule
- 0 prefix rules, 2 suffix rules
This may not play well with some affix rule dictionaries. For example the czech
dictionary is constructed in such a way that only one suffix rule can be
applied otherwise the filter can generate irrelevant tokens. So the
recursionCap MUST be set to 0.
However, if this recursion level is consumed on removal of prefix then it can
not continue and manipulate also the suffix. So what I am proposing is having
an option to set recursionCap separately for prefix and suffix. In case of
czech dict I would say: you can apply only one prefix rule and only one suffix
rule (meaning you can NEVER apply two prefix rules or two affix rules).
As for ignoreCase - how does it work if the dictionary contains terms like "Xx"
and "xx" and each is allowed to use different set of rules? I need to
distinguish between them. But at the same hand if the dictionary contains only
"yy" but I get "Yy" as input (because it was the first word of the sentence)
would it be able to process it correctly and still distinguish between "Xx" and
"xx"?
As for the last feature I probably confused you. What I am looking for is not
output of all possible root words for given term but all possible inflections
for given (root) word. For example: input is "tell" and based on loaded
dictionary the output would be ["tell","tells","telling", ...]. I think it wold
not be hard to expose such API and I believe users would appreciate this when
constructing custom dictionaries (I tried that and I was missing such feature,
sure I can implement it myself but I believe having it in Solr and
Elasticsearch would be great, definitely this is not useful for indexing
process but as a part of tuning your dictionary this would be helpful).
> Hunspell very high memory use when loading dictionary
> -----------------------------------------------------
>
> Key: LUCENE-5468
> URL: https://issues.apache.org/jira/browse/LUCENE-5468
> Project: Lucene - Core
> Issue Type: Bug
> Affects Versions: 3.5
> Reporter: Maciej Lisiewski
> Priority: Minor
> Fix For: 4.8, 5.0
>
> Attachments: LUCENE-5468.patch, patch.txt
>
>
> Hunspell stemmer requires gigantic (for the task) amounts of memory to load
> dictionary/rules files.
> For example loading a 4.5 MB polish dictionary (with empty index!) will cause
> whole core to crash with various out of memory errors unless you set max heap
> size close to 2GB or more.
> By comparison Stempel using the same dictionary file works just fine with 1/8
> of that (and possibly lower values as well).
> Sample error log entries:
> http://pastebin.com/fSrdd5W1
> http://pastebin.com/Lmi0re7Z
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]