[ 
https://issues.apache.org/jira/browse/LUCENE-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13915281#comment-13915281
 ] 

Robert Muir commented on LUCENE-5468:
-------------------------------------

{quote}
Lowercasing in Hunspell - Robert, when you think about it there is really no 
simple solution to this using existing Lucene analysis flow AFAIK. If you apply 
lowercase BEFORE Hunspell you lose the option to correctly stem the uppercased 
token (if there is any record for it in the dictionary). If you apply if after 
Hunspell you have the problem with first token in sentences (in most cases). 
The other option is (as you mentioned) employ some more sophisticated analysis 
chain (but is there any suitable in Lucene out of the box or do I have to go 
down the road and setup complex language library or framework?)
So the option to allow lowercasing for second pass is IMO nice compromise that 
can help a lot with really minimal effort (and it is also easy to explain to 
users what it does and when to use it). It is not perfect solution but may be 
good enough to solve 80/20 principle.
{quote}

There may not be, but its about where the responsibility should be. Its more 
than the first token in sentences: named entities etc are involved too. If you 
want to get this right, yes, you need a more sophisticated analysis chain! That 
being said, I'm not against your 80/20 heuristic, I'm just not sure how 80/20 
it is :)

{quote}
Getting all inflections - yes, there are CL tools for this. But this is really 
more about user experience comfort, and again, it is easy to explain how to use 
it, what it does and users do not have to mess with CL tools and things like 
that. Not sure how hard it would be to implement this with what is in Hunspell 
now.
Also one thing is some CL tool used against some dictionary files and other 
thing can be using Lucene code on dictionary loaded into memory by Lucene. If 
there are issues in the code these two approaches can give different results 
(yes, they should be the same...)
{quote}

On this one i honestly do disagree. I dont mean to sound rude, but if you are 
smart enough to make a custom dictionary, I don't think I need to baby such 
users around and make them comfortable by duplicating command line tools they 
can install themselves in java :) The tools provided by hunspell are the best 
here, and if someone is making a custom dictionary they already need to be 
digging into these tools/docs to know what they are doing. I don't see a value 
in duplicating this stuff and providing morphological generation and other 
super-advanced esoteric stuff, when there are more basic things needed (like 
decomposition).  As far as if things differ, then those are bugs that should be 
fixed...


> Hunspell very high memory use when loading dictionary
> -----------------------------------------------------
>
>                 Key: LUCENE-5468
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5468
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 3.5
>            Reporter: Maciej Lisiewski
>            Priority: Minor
>             Fix For: 4.8, 5.0
>
>         Attachments: LUCENE-5468.patch, patch.txt
>
>
> Hunspell stemmer requires gigantic (for the task) amounts of memory to load 
> dictionary/rules files. 
> For example loading a 4.5 MB polish dictionary (with empty index!) will cause 
> whole core to crash with various out of memory errors unless you set max heap 
> size close to 2GB or more.
> By comparison Stempel using the same dictionary file works just fine with 1/8 
> of that (and possibly lower values as well).
> Sample error log entries:
> http://pastebin.com/fSrdd5W1
> http://pastebin.com/Lmi0re7Z



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to