[jira] [Commented] (LUCENE-5468) Hunspell very high memory use when loading dictionary

Lukas Vlcek (JIRA) Thu, 27 Feb 2014 16:25:26 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13915266#comment-13915266
 ]


Lukas Vlcek commented on LUCENE-5468:
-------------------------------------

OK, I will open a new ticket for the recursionCap tomorrow (it is late on my 
end now).

Just a real quick comments on my two other suggestions:

Lowercasing in Hunspell - Robert, when you think about it there is really no 
simple solution to this using existing Lucene analysis flow AFAIK. If you apply 
lowercase BEFORE Hunspell you lose the option to correctly stem the uppercased 
token (if there is any record for it in the dictionary). If you apply if after 
Hunspell you have the problem with first token in sentences (in most cases). 
The other option is (as you mentioned) employ some more sophisticated analysis 
chain (but is there any suitable in Lucene out of the box or do I have to go 
down the road and setup complex language library or framework?)
So the option to allow lowercasing for second pass is IMO nice compromise that 
can help a lot with really minimal effort (and it is also easy to explain to 
users what it does and when to use it). It is not perfect solution but may be 
good enough to solve 80/20 principle.

Getting all inflections - yes, there are CL tools for this. But this is really 
more about user experience comfort, and again, it is easy to explain how to use 
it, what it does and users do not have to mess with CL tools and things like 
that. Not sure how hard it would be to implement this with what is in Hunspell 
now.
Also one thing is some CL tool used against some dictionary files and other 
thing can be using Lucene code on dictionary loaded into memory by Lucene. If 
there are issues in the code these two approaches can give different results 
(yes, they should be the same...)

> Hunspell very high memory use when loading dictionary
> -----------------------------------------------------
>
>                 Key: LUCENE-5468
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5468
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 3.5
>            Reporter: Maciej Lisiewski
>            Priority: Minor
>             Fix For: 4.8, 5.0
>
>         Attachments: LUCENE-5468.patch, patch.txt
>
>
> Hunspell stemmer requires gigantic (for the task) amounts of memory to load 
> dictionary/rules files. 
> For example loading a 4.5 MB polish dictionary (with empty index!) will cause 
> whole core to crash with various out of memory errors unless you set max heap 
> size close to 2GB or more.
> By comparison Stempel using the same dictionary file works just fine with 1/8 
> of that (and possibly lower values as well).
> Sample error log entries:
> http://pastebin.com/fSrdd5W1
> http://pastebin.com/Lmi0re7Z



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5468) Hunspell very high memory use when loading dictionary

Reply via email to