[
https://issues.apache.org/jira/browse/LUCENE-4481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael McCandless updated LUCENE-4481:
---------------------------------------
Attachment: LUCENE-4481.patch
OK, new patch, this time adding back some optos:
{noformat}
[junit4:junit4] Suite: org.apache.lucene.search.suggest.LookupBenchmarkTest
[junit4:junit4] 2> -- RAM consumption
[junit4:junit4] 2> JaspellLookup size[B]: 9,815,152
[junit4:junit4] 2> TSTLookup size[B]: 9,858,792
[junit4:junit4] 2> FSTCompletionLookup size[B]: 466,520
[junit4:junit4] 2> WFSTCompletionLookup size[B]: 507,640
[junit4:junit4] 2> AnalyzingSuggester size[B]: 889,138
[junit4:junit4] OK 1.67s | LookupBenchmarkTest.testStorageNeeds
[junit4:junit4] 2> -- prefixes: 6-9, num: 7, onlyMorePopular: true
[junit4:junit4] 2> JaspellLookup queries: 50001, time[ms]: 108 [+- 8.81],
~kQPS: 464
[junit4:junit4] 2> TSTLookup queries: 50001, time[ms]: 79 [+- 1.07],
~kQPS: 631
[junit4:junit4] 2> FSTCompletionLookup queries: 50001, time[ms]: 148 [+-
2.54], ~kQPS: 339
[junit4:junit4] 2> WFSTCompletionLookup queries: 50001, time[ms]: 67 [+-
2.78], ~kQPS: 745
[junit4:junit4] 2> AnalyzingSuggester queries: 50001, time[ms]: 260 [+-
3.92], ~kQPS: 192
[junit4:junit4] OK 14.6s | LookupBenchmarkTest.testPerformanceOnPrefixes6_9
[junit4:junit4] 2> -- prefixes: 2-4, num: 7, onlyMorePopular: true
[junit4:junit4] 2> JaspellLookup queries: 50001, time[ms]: 262 [+- 5.16],
~kQPS: 191
[junit4:junit4] 2> TSTLookup queries: 50001, time[ms]: 641 [+- 6.46],
~kQPS: 78
[junit4:junit4] 2> FSTCompletionLookup queries: 50001, time[ms]: 118 [+-
2.95], ~kQPS: 424
[junit4:junit4] 2> WFSTCompletionLookup queries: 50001, time[ms]: 239 [+-
4.84], ~kQPS: 210
[junit4:junit4] 2> AnalyzingSuggester queries: 50001, time[ms]: 660 [+-
7.39], ~kQPS: 76
[junit4:junit4] OK 39.0s | LookupBenchmarkTest.testPerformanceOnPrefixes2_4
[junit4:junit4] 2> -- construction time
[junit4:junit4] 2> JaspellLookup input: 50001, time[ms]: 23 [+- 4.20]
[junit4:junit4] 2> TSTLookup input: 50001, time[ms]: 64 [+- 2.06]
[junit4:junit4] 2> FSTCompletionLookup input: 50001, time[ms]: 120 [+- 2.11]
[junit4:junit4] 2> WFSTCompletionLookup input: 50001, time[ms]: 88 [+- 1.09]
[junit4:junit4] 2> AnalyzingSuggester input: 50001, time[ms]: 245 [+- 27.85]
[junit4:junit4] OK 10.9s | LookupBenchmarkTest.testConstructionTime
[junit4:junit4] 2> -- prefixes: 100-200, num: 7, onlyMorePopular: true
[junit4:junit4] 2> JaspellLookup queries: 50001, time[ms]: 68 [+- 1.17],
~kQPS: 731
[junit4:junit4] 2> TSTLookup queries: 50001, time[ms]: 31 [+- 2.82],
~kQPS: 1617
[junit4:junit4] 2> FSTCompletionLookup queries: 50001, time[ms]: 141 [+-
1.97], ~kQPS: 354
[junit4:junit4] 2> WFSTCompletionLookup queries: 50001, time[ms]: 45 [+-
3.37], ~kQPS: 1099
[junit4:junit4] 2> AnalyzingSuggester queries: 50001, time[ms]: 233 [+-
4.02], ~kQPS: 215
[junit4:junit4] OK 11.1s | LookupBenchmarkTest.testPerformanceOnFullHits
[junit4:junit4] Completed in 77.54s, 5 tests
{noformat}
I added 2nd param (maxQueueDepth) to TopNSearcher, and fixed
WFSTSuggester to pass topN for that (should get back most of its
perf). I also fixed AnalyzingSuggester: we can bound how big a queue
we need by the worst case number of analyzed forms for a single
surface form. This is nice because if the analyzed doesn't create a
graph then we should have close to same perf as before.
> AnalyzingSuggester may fail to return correct topN suggestions
> --------------------------------------------------------------
>
> Key: LUCENE-4481
> URL: https://issues.apache.org/jira/browse/LUCENE-4481
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: 4.1, 5.0
>
> Attachments: LUCENE-4481.patch, LUCENE-4481.patch, LUCENE-4481.patch,
> LUCENE-4481.patch
>
>
> I hit this when working on LUCENE-4480.
> Because AnalyzingSuggester may prune some of the topN paths found by FST's
> Util.TopNSearcher, this means the queue size limit of topN makes the overall
> search inadmissible, ie it may incorrectly prune paths that would have lead
> to a competitive path.
> However, such pruning is rare: it happens only for graph token streams, and
> even then only when competitive analyzed forms share the same surface forms.
> The simplest way to fix this is to make the queue unbounded but this is
> likely a sizable performance hit ... I haven't tested yet. It's even
> possible the way the dups happen (always at the "end" of the suggestion,
> because we tack on 0 byte followed by ord dedup byte) prevent this bug from
> even occurring and so this could all be a false alarm! I have to try to make
> a test case showing it ...
> A cop-out solution would be to expose a separate queueSize or queueMultiplier
> (over the topN) so that if users are affected by this they could crank up the
> queue size or multiplier.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]