[jira] [Commented] (LUCENE-5468) Hunspell very high memory use when loading dictionary

Lukas Vlcek (JIRA) Thu, 27 Feb 2014 15:13:27 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13915171#comment-13915171
 ]


Lukas Vlcek commented on LUCENE-5468:
-------------------------------------

Amazing improvement!

While we are on Hunspell I would like to make a proposal for additional 
enhancements but first I would like to ask if you would be interested in seeing 
such improvements in the code. I would be happy to open a new ticket for this 
in such case.

1) AFAIR Hunspell token filter has an option to setup level of recursion. 
Originally hardcoded to 2 if I am not mistaken. But the level of recursion 
counts for both prefix and postfix rules - meaning if it is set to 2 and 1 
prefix rule is applied, then we can only apply 2-1 suffix rules. What I would 
like to propose is adding an option to explicitly specify recursion level for 
both the prefix and for postfix rules. This probably depends a lot on how the 
affix rules are constructed but I can clearly see this would help in case of 
Czech dictionary - hopefully this might be found useful for other languages too.

2) Case sensitivity is a tricky part. Czech dictionary is case sensitive and it 
can deliver very nice results but users can not always fully benefit from this. 
The biggest problem I remember are tokens at the beginning of sentences. They 
start with capitals and thus they may not be found in dict where only 
lowercased variation is recorded.
I was thinking that one useful solution to this issue can be adding an option 
to lowercase given token if it hasn't been found in dict and making a second 
pass through the filter again with lowercased token (it is costly but would be 
only optional so user is the one to decide if this is worth the indexing time).

3) Also it would be really useful if Hunspell token filter provided an option 
to output all terms that are the result of application of relevant rules to 
input token (so in essence quite opposite transformation to what is used during 
stemming). Such functionality would be useful if users want to add custom 
extension to existing dictionary (having an option to load several dict files 
is really useful IMO) and they want to check that they constructed valid rules 
for specific words. Having Lucene directly supporting them via exposed API 
would be great I think (especially when thinking about later applications in 
Solr and Elasticsearch).

> Hunspell very high memory use when loading dictionary
> -----------------------------------------------------
>
>                 Key: LUCENE-5468
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5468
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 3.5
>            Reporter: Maciej Lisiewski
>            Priority: Minor
>             Fix For: 4.8, 5.0
>
>         Attachments: LUCENE-5468.patch, patch.txt
>
>
> Hunspell stemmer requires gigantic (for the task) amounts of memory to load 
> dictionary/rules files. 
> For example loading a 4.5 MB polish dictionary (with empty index!) will cause 
> whole core to crash with various out of memory errors unless you set max heap 
> size close to 2GB or more.
> By comparison Stempel using the same dictionary file works just fine with 1/8 
> of that (and possibly lower values as well).
> Sample error log entries:
> http://pastebin.com/fSrdd5W1
> http://pastebin.com/Lmi0re7Z



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5468) Hunspell very high memory use when loading dictionary

Reply via email to