[
https://issues.apache.org/jira/browse/LUCENE-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13915266#comment-13915266
]
Lukas Vlcek commented on LUCENE-5468:
-------------------------------------
OK, I will open a new ticket for the recursionCap tomorrow (it is late on my
end now).
Just a real quick comments on my two other suggestions:
Lowercasing in Hunspell - Robert, when you think about it there is really no
simple solution to this using existing Lucene analysis flow AFAIK. If you apply
lowercase BEFORE Hunspell you lose the option to correctly stem the uppercased
token (if there is any record for it in the dictionary). If you apply if after
Hunspell you have the problem with first token in sentences (in most cases).
The other option is (as you mentioned) employ some more sophisticated analysis
chain (but is there any suitable in Lucene out of the box or do I have to go
down the road and setup complex language library or framework?)
So the option to allow lowercasing for second pass is IMO nice compromise that
can help a lot with really minimal effort (and it is also easy to explain to
users what it does and when to use it). It is not perfect solution but may be
good enough to solve 80/20 principle.
Getting all inflections - yes, there are CL tools for this. But this is really
more about user experience comfort, and again, it is easy to explain how to use
it, what it does and users do not have to mess with CL tools and things like
that. Not sure how hard it would be to implement this with what is in Hunspell
now.
Also one thing is some CL tool used against some dictionary files and other
thing can be using Lucene code on dictionary loaded into memory by Lucene. If
there are issues in the code these two approaches can give different results
(yes, they should be the same...)
> Hunspell very high memory use when loading dictionary
> -----------------------------------------------------
>
> Key: LUCENE-5468
> URL: https://issues.apache.org/jira/browse/LUCENE-5468
> Project: Lucene - Core
> Issue Type: Bug
> Affects Versions: 3.5
> Reporter: Maciej Lisiewski
> Priority: Minor
> Fix For: 4.8, 5.0
>
> Attachments: LUCENE-5468.patch, patch.txt
>
>
> Hunspell stemmer requires gigantic (for the task) amounts of memory to load
> dictionary/rules files.
> For example loading a 4.5 MB polish dictionary (with empty index!) will cause
> whole core to crash with various out of memory errors unless you set max heap
> size close to 2GB or more.
> By comparison Stempel using the same dictionary file works just fine with 1/8
> of that (and possibly lower values as well).
> Sample error log entries:
> http://pastebin.com/fSrdd5W1
> http://pastebin.com/Lmi0re7Z
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]