[jira] [Commented] (LUCENE-5468) Hunspell very high memory use when loading dictionary

Robert Muir (JIRA) Thu, 27 Feb 2014 15:55:27 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13915230#comment-13915230
 ]


Robert Muir commented on LUCENE-5468:
-------------------------------------

ok, i was a little confused. I thought perhaps you referred to the previous 
discussion above about removing things :)

I just want to make it clear i kept all the additional options we already had!

{quote}
So what I am proposing is having an option to set recursionCap separately for 
prefix and suffix. In case of czech dict I would say: you can apply only one 
prefix rule and only one suffix rule (meaning you can NEVER apply two prefix 
rules or two affix rules).
{quote}

+1, can you open an issue for this?

{quote}
As for ignoreCase - how does it work if the dictionary contains terms like "Xx" 
and "xx" and each is allowed to use different set of rules? I need to 
distinguish between them. 
{quote}

Right, thats why it does nothing by default :)

{quote}
But at the same hand if the dictionary contains only "yy" but I get "Yy" as 
input (because it was the first word of the sentence) would it be able to 
process it correctly and still distinguish between "Xx" and "xx"?
{quote}

In my opinion, this is not the responsibility of this filter (it simply has 
ignoreCase on or off). This has more to do with your analysis chain? So if you 
want to put a lowercase filter first always, thats one approach. If you want to 
use some rule/heuristic for sentence tokenization or other fancy stuff, you can 
selectively lowercase and get what you want. But this filter knows nothing 
about that :)

{quote}
I think it wold not be hard to expose such API and I believe users would 
appreciate this when constructing custom dictionaries (I tried that and I was 
missing such feature, sure I can implement it myself but I believe having it in 
Solr and Elasticsearch would be great, definitely this is not useful for 
indexing process but as a part of tuning your dictionary this would be helpful).
{quote}

Why not just use the hunspell command-line tools like 'unmunch', 'analyze', etc 
for that?

> Hunspell very high memory use when loading dictionary
> -----------------------------------------------------
>
>                 Key: LUCENE-5468
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5468
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 3.5
>            Reporter: Maciej Lisiewski
>            Priority: Minor
>             Fix For: 4.8, 5.0
>
>         Attachments: LUCENE-5468.patch, patch.txt
>
>
> Hunspell stemmer requires gigantic (for the task) amounts of memory to load 
> dictionary/rules files. 
> For example loading a 4.5 MB polish dictionary (with empty index!) will cause 
> whole core to crash with various out of memory errors unless you set max heap 
> size close to 2GB or more.
> By comparison Stempel using the same dictionary file works just fine with 1/8 
> of that (and possibly lower values as well).
> Sample error log entries:
> http://pastebin.com/fSrdd5W1
> http://pastebin.com/Lmi0re7Z



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5468) Hunspell very high memory use when loading dictionary

Reply via email to