[ 
https://issues.apache.org/jira/browse/LUCENE-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011014#comment-14011014
 ] 

Nik Everett commented on LUCENE-4556:
-------------------------------------

I'm having GC trouble and I'm using the DirectCandidateGenerator.  Its 
obviously kind of hard to tell how much the automata is contributing in 
production but when I try it locally just generating the automata for two or 
three terms takes about 200KB of memory.  Napkin math (200KB * 
250queries/second) says this makes about 50MB of garbage per second per index.  
Obviously it gets worse if you run this in a sharded context where each shard 
does the generating.  Well, not really worse, but the large up front cost and 
memory consumption of this process is relatively static based on shard size so 
this becomes a reason to use larger shards. 

I should propose that in addition to Simon's patches another other option is to 
try to implement something like the stack based automaton simulation that the 
Schulz Mihov paper (the one that proposed the Lev automaton) describes in 
section 6.  Its not useful for stuff like intersecting the enums but if you are 
willing to forgo that you could probably get away with much less memory 
consumption.

> FuzzyTermsEnum creates tons of objects
> --------------------------------------
>
>                 Key: LUCENE-4556
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4556
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search, modules/spellchecker
>    Affects Versions: 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael McCandless
>            Priority: Critical
>             Fix For: 4.9, 5.0
>
>         Attachments: LUCENE-4556.patch, LUCENE-4556.patch
>
>
> I ran into this problem in production using the DirectSpellchecker. The 
> number of objects created by the spellchecker shoot through the roof very 
> very quickly. We ran about 130 queries and ended up with > 2M transitions / 
> states. We spend 50% of the time in GC just because of transitions. Other 
> parts of the system behave just fine here.
> I talked quickly to robert and gave a POC a shot providing a 
> LevenshteinAutomaton#toRunAutomaton(prefix, n) method to optimize this case 
> and build a array based strucuture converted into UTF-8 directly instead of 
> going through the object based APIs. This involved quite a bit of changes but 
> they are all package private at this point. I have a patch that still has a 
> fair set of nocommits but its shows that its possible and IMO worth the 
> trouble to make this really useable in production. All tests pass with the 
> patch - its a start....



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to