[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

Han Jiang (JIRA) Wed, 31 Jul 2013 00:14:45 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13724955#comment-13724955
 ]


Han Jiang commented on LUCENE-3069:
-----------------------------------

Performance result after last patch(intersect) is applied.

On wiki 33M data, between TempFST(with intersect) and TempFSTOrd(with 
intersect):
{noformat}
                    Task    QPS base      StdDev    QPS comp      StdDev        
        Pct diff
                PKLookup      232.47      (1.0%)      205.28      (2.0%)  
-11.7% ( -14% -   -8%)
                 Prefix3       26.93      (1.2%)       28.40      (1.4%)    
5.5% (   2% -    8%)
                Wildcard        6.75      (2.1%)        7.37      (1.5%)    
9.2% (   5% -   13%)
                  Fuzzy1       29.86      (1.8%)       51.87      (3.7%)   
73.7% (  67% -   80%)
                  Fuzzy2       30.82      (1.6%)       53.82      (2.7%)   
74.7% (  69% -   80%)
                 Respell       27.30      (1.2%)       49.55      (2.6%)   
81.5% (  76% -   86%)
{noformat}

So the decoding of outputs is really the main hurt.

And now we should start to compare it with trunk (base=Lucene41, 
comp=TempFSTOrd):
Hmm, I must have done something wrong on wildcard query here.

{noformat}
                    Task    QPS base      StdDev    QPS comp      StdDev        
        Pct diff
                Wildcard       19.21      (2.1%)        7.30      (0.3%)  
-62.0% ( -63% -  -60%)
                 Prefix3       33.69      (1.2%)       28.18      (0.9%)  
-16.4% ( -18% -  -14%)
                  Fuzzy1       61.59      (2.1%)       52.36      (0.8%)  
-15.0% ( -17% -  -12%)
                  Fuzzy2       60.94      (1.0%)       54.15      (1.3%)  
-11.1% ( -13% -   -8%)
                 Respell       54.21      (2.8%)       49.54      (1.2%)   
-8.6% ( -12% -   -4%)
                PKLookup      148.40      (1.0%)      208.07      (3.6%)   
40.2% (  35% -   45%)
{noformat}

I'll commit current version so we can iterate on it.
                
> Lucene should have an entirely memory resident term dictionary
> --------------------------------------------------------------
>
>                 Key: LUCENE-3069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3069
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0-ALPHA
>            Reporter: Simon Willnauer
>            Assignee: Han Jiang
>              Labels: gsoc2013
>             Fix For: 5.0, 4.5
>
>         Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

Reply via email to