[ 
https://issues.apache.org/jira/browse/LUCENE-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13496439#comment-13496439
 ] 

Michael McCandless commented on LUCENE-4556:
--------------------------------------------

What spooks me about this patch is this code (LevenshteinAutomaton) is already 
REALLY hairy ... and this change would add yet more hair to it (when really we 
need to be doing the reverse, so the code becomes approachable to new eyeballs).

Also: are we sure the objects created here are really such a heavy GC load...?

I ran a quick test, respelling (using DirectSpellChecker() w/ its defaults) a 
set of 500 5-character terms against the full Wikipedia English (33.M docs) 
index, using concurrent mark/sweep collector w/ 2 GB heap and I couldn't see 
any difference in the net throughput on a 24 core box ... both got ~780 
respells/sec.

Simon can you describe what use case you're seeing where GC is cutting 
throughput by 50%?
                
> FuzzyTermsEnum creates tons of objects
> --------------------------------------
>
>                 Key: LUCENE-4556
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4556
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search, modules/spellchecker
>    Affects Versions: 4.0
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>            Priority: Critical
>             Fix For: 4.1, 5.0
>
>         Attachments: LUCENE-4556.patch
>
>
> I ran into this problem in production using the DirectSpellchecker. The 
> number of objects created by the spellchecker shoot through the roof very 
> very quickly. We ran about 130 queries and ended up with > 2M transitions / 
> states. We spend 50% of the time in GC just because of transitions. Other 
> parts of the system behave just fine here.
> I talked quickly to robert and gave a POC a shot providing a 
> LevenshteinAutomaton#toRunAutomaton(prefix, n) method to optimize this case 
> and build a array based strucuture converted into UTF-8 directly instead of 
> going through the object based APIs. This involved quite a bit of changes but 
> they are all package private at this point. I have a patch that still has a 
> fair set of nocommits but its shows that its possible and IMO worth the 
> trouble to make this really useable in production. All tests pass with the 
> patch - its a start....

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to