[jira] [Commented] (LUCENE-5468) Hunspell very high memory use when loading dictionary

Robert Muir (JIRA) Sun, 02 Mar 2014 05:27:07 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13917408#comment-13917408
 ]


Robert Muir commented on LUCENE-5468:
-------------------------------------

{quote}
I fully understand YPOW. The question of responsibility is important. But if I 
consider that workaround like lowercasing for optional second pass could be 
easier than telling user to setup complicated analysis chain (or employ 
external system) then I believe it might make sense to do a qualified exception.
{quote}

This responsibility is really important though. Maybe you should break away 
from the czech dictionary and look at the others before you decide that its 
"easiest" here. For example the german dictionary has lots of complex casing 
rules encoded in the affix file itself for decompounding purposes. This feature 
already is *plenty complicated*. If you can do *ANYTHING* and I mean *ANYTHING* 
outside of it in any way, we should keep it out of here.

{quote}
As you pointed out, there are CL tools for this but I simply did not want to 
learn them (I did not feel like a wizard). And the good question is if Lucene 
should be able to provide API that could be used for this task. In the end of 
the day Lucene is said to be a IR library and has language analysis 
capabilities, so why not? But I am fine to leave this feature out now. Just 
wanted to explain some of my motivations for this feature.
{quote}

Because its an IR library, not a tool for building lexical resources. We just 
dont have the resources to "compete" with that, we don't have people that need 
it, and why waste our time when there are perfectly good tools available? I 
don't know why you refuse to "learn" the hunspell tools, they are trivial to 
learn!

Besides the commandline tools, quick searches reveal GUI tools too, such as 
http://marcoagpinto.cidadevirtual.pt/proofingtoolgui.html. Quote from the page: 
"My tool is so intuitive that even a 6-year-old kid can use it."

I don't think such work should be duplicated inside the apache lucene project.

> Hunspell very high memory use when loading dictionary
> -----------------------------------------------------
>
>                 Key: LUCENE-5468
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5468
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 3.5
>            Reporter: Maciej Lisiewski
>            Priority: Minor
>             Fix For: 4.8, 5.0
>
>         Attachments: LUCENE-5468.patch, patch.txt
>
>
> Hunspell stemmer requires gigantic (for the task) amounts of memory to load 
> dictionary/rules files. 
> For example loading a 4.5 MB polish dictionary (with empty index!) will cause 
> whole core to crash with various out of memory errors unless you set max heap 
> size close to 2GB or more.
> By comparison Stempel using the same dictionary file works just fine with 1/8 
> of that (and possibly lower values as well).
> Sample error log entries:
> http://pastebin.com/fSrdd5W1
> http://pastebin.com/Lmi0re7Z



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5468) Hunspell very high memory use when loading dictionary

Reply via email to