[
https://issues.apache.org/jira/browse/LUCENE-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13915230#comment-13915230
]
Robert Muir commented on LUCENE-5468:
-------------------------------------
ok, i was a little confused. I thought perhaps you referred to the previous
discussion above about removing things :)
I just want to make it clear i kept all the additional options we already had!
{quote}
So what I am proposing is having an option to set recursionCap separately for
prefix and suffix. In case of czech dict I would say: you can apply only one
prefix rule and only one suffix rule (meaning you can NEVER apply two prefix
rules or two affix rules).
{quote}
+1, can you open an issue for this?
{quote}
As for ignoreCase - how does it work if the dictionary contains terms like "Xx"
and "xx" and each is allowed to use different set of rules? I need to
distinguish between them.
{quote}
Right, thats why it does nothing by default :)
{quote}
But at the same hand if the dictionary contains only "yy" but I get "Yy" as
input (because it was the first word of the sentence) would it be able to
process it correctly and still distinguish between "Xx" and "xx"?
{quote}
In my opinion, this is not the responsibility of this filter (it simply has
ignoreCase on or off). This has more to do with your analysis chain? So if you
want to put a lowercase filter first always, thats one approach. If you want to
use some rule/heuristic for sentence tokenization or other fancy stuff, you can
selectively lowercase and get what you want. But this filter knows nothing
about that :)
{quote}
I think it wold not be hard to expose such API and I believe users would
appreciate this when constructing custom dictionaries (I tried that and I was
missing such feature, sure I can implement it myself but I believe having it in
Solr and Elasticsearch would be great, definitely this is not useful for
indexing process but as a part of tuning your dictionary this would be helpful).
{quote}
Why not just use the hunspell command-line tools like 'unmunch', 'analyze', etc
for that?
> Hunspell very high memory use when loading dictionary
> -----------------------------------------------------
>
> Key: LUCENE-5468
> URL: https://issues.apache.org/jira/browse/LUCENE-5468
> Project: Lucene - Core
> Issue Type: Bug
> Affects Versions: 3.5
> Reporter: Maciej Lisiewski
> Priority: Minor
> Fix For: 4.8, 5.0
>
> Attachments: LUCENE-5468.patch, patch.txt
>
>
> Hunspell stemmer requires gigantic (for the task) amounts of memory to load
> dictionary/rules files.
> For example loading a 4.5 MB polish dictionary (with empty index!) will cause
> whole core to crash with various out of memory errors unless you set max heap
> size close to 2GB or more.
> By comparison Stempel using the same dictionary file works just fine with 1/8
> of that (and possibly lower values as well).
> Sample error log entries:
> http://pastebin.com/fSrdd5W1
> http://pastebin.com/Lmi0re7Z
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]