[
https://issues.apache.org/jira/browse/LUCENE-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13915171#comment-13915171
]
Lukas Vlcek commented on LUCENE-5468:
-------------------------------------
Amazing improvement!
While we are on Hunspell I would like to make a proposal for additional
enhancements but first I would like to ask if you would be interested in seeing
such improvements in the code. I would be happy to open a new ticket for this
in such case.
1) AFAIR Hunspell token filter has an option to setup level of recursion.
Originally hardcoded to 2 if I am not mistaken. But the level of recursion
counts for both prefix and postfix rules - meaning if it is set to 2 and 1
prefix rule is applied, then we can only apply 2-1 suffix rules. What I would
like to propose is adding an option to explicitly specify recursion level for
both the prefix and for postfix rules. This probably depends a lot on how the
affix rules are constructed but I can clearly see this would help in case of
Czech dictionary - hopefully this might be found useful for other languages too.
2) Case sensitivity is a tricky part. Czech dictionary is case sensitive and it
can deliver very nice results but users can not always fully benefit from this.
The biggest problem I remember are tokens at the beginning of sentences. They
start with capitals and thus they may not be found in dict where only
lowercased variation is recorded.
I was thinking that one useful solution to this issue can be adding an option
to lowercase given token if it hasn't been found in dict and making a second
pass through the filter again with lowercased token (it is costly but would be
only optional so user is the one to decide if this is worth the indexing time).
3) Also it would be really useful if Hunspell token filter provided an option
to output all terms that are the result of application of relevant rules to
input token (so in essence quite opposite transformation to what is used during
stemming). Such functionality would be useful if users want to add custom
extension to existing dictionary (having an option to load several dict files
is really useful IMO) and they want to check that they constructed valid rules
for specific words. Having Lucene directly supporting them via exposed API
would be great I think (especially when thinking about later applications in
Solr and Elasticsearch).
> Hunspell very high memory use when loading dictionary
> -----------------------------------------------------
>
> Key: LUCENE-5468
> URL: https://issues.apache.org/jira/browse/LUCENE-5468
> Project: Lucene - Core
> Issue Type: Bug
> Affects Versions: 3.5
> Reporter: Maciej Lisiewski
> Priority: Minor
> Fix For: 4.8, 5.0
>
> Attachments: LUCENE-5468.patch, patch.txt
>
>
> Hunspell stemmer requires gigantic (for the task) amounts of memory to load
> dictionary/rules files.
> For example loading a 4.5 MB polish dictionary (with empty index!) will cause
> whole core to crash with various out of memory errors unless you set max heap
> size close to 2GB or more.
> By comparison Stempel using the same dictionary file works just fine with 1/8
> of that (and possibly lower values as well).
> Sample error log entries:
> http://pastebin.com/fSrdd5W1
> http://pastebin.com/Lmi0re7Z
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]