[
https://issues.apache.org/jira/browse/LUCENE-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13915281#comment-13915281
]
Robert Muir commented on LUCENE-5468:
-------------------------------------
{quote}
Lowercasing in Hunspell - Robert, when you think about it there is really no
simple solution to this using existing Lucene analysis flow AFAIK. If you apply
lowercase BEFORE Hunspell you lose the option to correctly stem the uppercased
token (if there is any record for it in the dictionary). If you apply if after
Hunspell you have the problem with first token in sentences (in most cases).
The other option is (as you mentioned) employ some more sophisticated analysis
chain (but is there any suitable in Lucene out of the box or do I have to go
down the road and setup complex language library or framework?)
So the option to allow lowercasing for second pass is IMO nice compromise that
can help a lot with really minimal effort (and it is also easy to explain to
users what it does and when to use it). It is not perfect solution but may be
good enough to solve 80/20 principle.
{quote}
There may not be, but its about where the responsibility should be. Its more
than the first token in sentences: named entities etc are involved too. If you
want to get this right, yes, you need a more sophisticated analysis chain! That
being said, I'm not against your 80/20 heuristic, I'm just not sure how 80/20
it is :)
{quote}
Getting all inflections - yes, there are CL tools for this. But this is really
more about user experience comfort, and again, it is easy to explain how to use
it, what it does and users do not have to mess with CL tools and things like
that. Not sure how hard it would be to implement this with what is in Hunspell
now.
Also one thing is some CL tool used against some dictionary files and other
thing can be using Lucene code on dictionary loaded into memory by Lucene. If
there are issues in the code these two approaches can give different results
(yes, they should be the same...)
{quote}
On this one i honestly do disagree. I dont mean to sound rude, but if you are
smart enough to make a custom dictionary, I don't think I need to baby such
users around and make them comfortable by duplicating command line tools they
can install themselves in java :) The tools provided by hunspell are the best
here, and if someone is making a custom dictionary they already need to be
digging into these tools/docs to know what they are doing. I don't see a value
in duplicating this stuff and providing morphological generation and other
super-advanced esoteric stuff, when there are more basic things needed (like
decomposition). As far as if things differ, then those are bugs that should be
fixed...
> Hunspell very high memory use when loading dictionary
> -----------------------------------------------------
>
> Key: LUCENE-5468
> URL: https://issues.apache.org/jira/browse/LUCENE-5468
> Project: Lucene - Core
> Issue Type: Bug
> Affects Versions: 3.5
> Reporter: Maciej Lisiewski
> Priority: Minor
> Fix For: 4.8, 5.0
>
> Attachments: LUCENE-5468.patch, patch.txt
>
>
> Hunspell stemmer requires gigantic (for the task) amounts of memory to load
> dictionary/rules files.
> For example loading a 4.5 MB polish dictionary (with empty index!) will cause
> whole core to crash with various out of memory errors unless you set max heap
> size close to 2GB or more.
> By comparison Stempel using the same dictionary file works just fine with 1/8
> of that (and possibly lower values as well).
> Sample error log entries:
> http://pastebin.com/fSrdd5W1
> http://pastebin.com/Lmi0re7Z
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]