[
https://issues.apache.org/jira/browse/LUCENE-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011014#comment-14011014
]
Nik Everett commented on LUCENE-4556:
-------------------------------------
I'm having GC trouble and I'm using the DirectCandidateGenerator. Its
obviously kind of hard to tell how much the automata is contributing in
production but when I try it locally just generating the automata for two or
three terms takes about 200KB of memory. Napkin math (200KB *
250queries/second) says this makes about 50MB of garbage per second per index.
Obviously it gets worse if you run this in a sharded context where each shard
does the generating. Well, not really worse, but the large up front cost and
memory consumption of this process is relatively static based on shard size so
this becomes a reason to use larger shards.
I should propose that in addition to Simon's patches another other option is to
try to implement something like the stack based automaton simulation that the
Schulz Mihov paper (the one that proposed the Lev automaton) describes in
section 6. Its not useful for stuff like intersecting the enums but if you are
willing to forgo that you could probably get away with much less memory
consumption.
> FuzzyTermsEnum creates tons of objects
> --------------------------------------
>
> Key: LUCENE-4556
> URL: https://issues.apache.org/jira/browse/LUCENE-4556
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search, modules/spellchecker
> Affects Versions: 4.0
> Reporter: Simon Willnauer
> Assignee: Michael McCandless
> Priority: Critical
> Fix For: 4.9, 5.0
>
> Attachments: LUCENE-4556.patch, LUCENE-4556.patch
>
>
> I ran into this problem in production using the DirectSpellchecker. The
> number of objects created by the spellchecker shoot through the roof very
> very quickly. We ran about 130 queries and ended up with > 2M transitions /
> states. We spend 50% of the time in GC just because of transitions. Other
> parts of the system behave just fine here.
> I talked quickly to robert and gave a POC a shot providing a
> LevenshteinAutomaton#toRunAutomaton(prefix, n) method to optimize this case
> and build a array based strucuture converted into UTF-8 directly instead of
> going through the object based APIs. This involved quite a bit of changes but
> they are all package private at this point. I have a patch that still has a
> fair set of nocommits but its shows that its possible and IMO worth the
> trouble to make this really useable in production. All tests pass with the
> patch - its a start....
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]