[
https://issues.apache.org/jira/browse/LUCENE-3729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael McCandless updated LUCENE-3729:
---------------------------------------
Attachment: LUCENE-3729.patch
Prototype patch just for testing...
As a quick test for viability here... I hacked
FieldCacheImpl.DocTermsIndexImpl, to build an FST to map term <-> ord,
and changed the lookup method to use the new Util.getByOutput method.
Then I tested perf on 10M docs from Wikipedia:
{noformat}
Task QPS base StdDev base QPS fstfcStdDev fstfc Pct
diff
TermGroup1M 47.75 1.59 25.75 0.36 -48% -
-43%
TermBGroup1M 17.10 0.58 14.20 0.37 -21% -
-11%
PKLookup 158.73 6.07 155.84 3.00 -7% -
4%
TermTitleSort 43.49 2.54 42.73 1.84 -11% -
8%
Respell 81.13 3.24 80.67 3.83 -8% -
8%
Term 106.13 3.59 106.03 1.28 -4% -
4%
TermBGroup1M1P 25.31 0.44 25.37 0.54 -3% -
4%
Fuzzy2 55.32 1.21 55.76 2.55 -5% -
7%
Fuzzy1 74.06 1.21 74.88 2.80 -4% -
6%
SloppyPhrase 9.82 0.61 9.95 0.42 -8% -
12%
SpanNear 3.39 0.16 3.47 0.15 -6% -
12%
Phrase 9.29 0.69 9.66 0.69 -10% -
20%
Wildcard 20.15 0.66 21.23 0.46 0% -
11%
AndHighHigh 13.43 0.55 14.24 0.70 -3% -
15%
Prefix3 10.05 0.53 10.70 0.19 0% -
14%
AndHighMed 56.62 3.36 60.54 4.28 -6% -
21%
OrHighMed 25.78 0.98 27.75 1.51 -1% -
17%
OrHighHigh 10.97 0.41 11.82 0.63 -1% -
17%
IntNRQ 9.74 0.81 10.83 0.26 0% -
24%
{noformat}
Two-pass grouping took a big hit... and single-pass grouping a moderate
hit... but TermTitleSort was a minor slowdown, which is good news.
The net RAM required across all segs for the title field FST was 30.2
MB, vs 46.5 MB for the current FieldCache terms storage (PagedBytes +
PackedInts), which is ~35% less.
The FST for the group-by fields was quite a bit larger (~60%) RAM
usage than PagedBytes + PackedInts, because these fields are actually
randomly generated unicode strings...
I didn't make the change to use the FST for term -> ord lookup (have
to fix the binarySearchLookup method), but we really should do this
"for real" because it's doing an unnecessary binary search (repeated
ord -> term lookup) now. Ie, perf should be better than
above... grouping is a heavier user of binarySearchLookup than sorting
so it should help recover some of that slowdown.
Also, Util.getByOutput currently doesn't optimize for array'd
arcs... so if we fix that we should get some small perf gain.
To do this "for real" I think we should do it only with DocValues,
because the FST build time is relatively costly.
> Allow using FST to hold terms data in DocValues.BYTES_*_SORTED
> --------------------------------------------------------------
>
> Key: LUCENE-3729
> URL: https://issues.apache.org/jira/browse/LUCENE-3729
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Michael McCandless
> Attachments: LUCENE-3729.patch
>
>
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]