[
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13724955#comment-13724955
]
Han Jiang commented on LUCENE-3069:
-----------------------------------
Performance result after last patch(intersect) is applied.
On wiki 33M data, between TempFST(with intersect) and TempFSTOrd(with
intersect):
{noformat}
Task QPS base StdDev QPS comp StdDev
Pct diff
PKLookup 232.47 (1.0%) 205.28 (2.0%)
-11.7% ( -14% - -8%)
Prefix3 26.93 (1.2%) 28.40 (1.4%)
5.5% ( 2% - 8%)
Wildcard 6.75 (2.1%) 7.37 (1.5%)
9.2% ( 5% - 13%)
Fuzzy1 29.86 (1.8%) 51.87 (3.7%)
73.7% ( 67% - 80%)
Fuzzy2 30.82 (1.6%) 53.82 (2.7%)
74.7% ( 69% - 80%)
Respell 27.30 (1.2%) 49.55 (2.6%)
81.5% ( 76% - 86%)
{noformat}
So the decoding of outputs is really the main hurt.
And now we should start to compare it with trunk (base=Lucene41,
comp=TempFSTOrd):
Hmm, I must have done something wrong on wildcard query here.
{noformat}
Task QPS base StdDev QPS comp StdDev
Pct diff
Wildcard 19.21 (2.1%) 7.30 (0.3%)
-62.0% ( -63% - -60%)
Prefix3 33.69 (1.2%) 28.18 (0.9%)
-16.4% ( -18% - -14%)
Fuzzy1 61.59 (2.1%) 52.36 (0.8%)
-15.0% ( -17% - -12%)
Fuzzy2 60.94 (1.0%) 54.15 (1.3%)
-11.1% ( -13% - -8%)
Respell 54.21 (2.8%) 49.54 (1.2%)
-8.6% ( -12% - -4%)
PKLookup 148.40 (1.0%) 208.07 (3.6%)
40.2% ( 35% - 45%)
{noformat}
I'll commit current version so we can iterate on it.
> Lucene should have an entirely memory resident term dictionary
> --------------------------------------------------------------
>
> Key: LUCENE-3069
> URL: https://issues.apache.org/jira/browse/LUCENE-3069
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index, core/search
> Affects Versions: 4.0-ALPHA
> Reporter: Simon Willnauer
> Assignee: Han Jiang
> Labels: gsoc2013
> Fix For: 5.0, 4.5
>
> Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch,
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch,
> LUCENE-3069.patch, LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a
> delta codec file for scanning to terms. Some environments have enough memory
> available to keep the entire FST based term dict in memory. We should add a
> TermDictionary implementation that encodes all needed information for each
> term into the FST (custom fst.Output) and builds a FST from the entire term
> not just the delta.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]