[jira] [Commented] (LUCENE-4198) Allow codecs to index term impacts

Adrien Grand (JIRA) Fri, 12 Jan 2018 07:15:13 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16324101#comment-16324101
 ]


Adrien Grand commented on LUCENE-4198:
--------------------------------------

I tested wikibigall as well, which has the benefit of not having artificially 
truncated lengths like wikimedium:

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev        
        Pct diff
              AndHighLow     1440.24      (3.0%)      794.43      (2.9%)  
-44.8% ( -49% -  -40%)
              AndHighMed      121.80      (1.4%)       94.75      (1.5%)  
-22.2% ( -24% -  -19%)
             AndHighHigh       56.62      (1.2%)       45.26      (1.4%)  
-20.1% ( -22% -  -17%)
               OrHighMed       93.16      (3.3%)       78.18      (3.1%)  
-16.1% ( -21% -   -9%)
               OrHighLow      827.62      (2.6%)      748.49      (3.5%)   
-9.6% ( -15% -   -3%)
              OrHighHigh       35.14      (4.4%)       32.25      (4.6%)   
-8.2% ( -16% -    0%)
                  Fuzzy1      265.67      (4.7%)      246.12      (5.0%)   
-7.4% ( -16% -    2%)
               LowPhrase      166.32      (1.3%)      157.61      (1.6%)   
-5.2% (  -8% -   -2%)
                  Fuzzy2      184.41      (4.3%)      176.40      (3.5%)   
-4.3% ( -11% -    3%)
             LowSpanNear      749.77      (2.1%)      726.14      (2.2%)   
-3.2% (  -7% -    1%)
               MedPhrase       23.77      (2.0%)       23.14      (1.9%)   
-2.6% (  -6% -    1%)
              HighPhrase       18.73      (3.0%)       18.24      (3.0%)   
-2.6% (  -8% -    3%)
             MedSpanNear      113.11      (2.3%)      110.17      (2.0%)   
-2.6% (  -6% -    1%)
         MedSloppyPhrase       10.28      (6.5%)       10.07      (6.9%)   
-2.0% ( -14% -   12%)
         LowSloppyPhrase       12.68      (6.6%)       12.43      (7.1%)   
-2.0% ( -14% -   12%)
        HighSloppyPhrase        9.47      (7.0%)        9.29      (7.5%)   
-1.9% ( -15% -   13%)
                  IntNRQ       27.89      (7.0%)       27.58      (8.7%)   
-1.1% ( -15% -   15%)
            HighSpanNear        9.05      (4.9%)        8.98      (4.7%)   
-0.8% (  -9% -    9%)
                 Respell      273.80      (2.3%)      273.79      (2.2%)   
-0.0% (  -4% -    4%)
       HighTermMonthSort       68.77      (7.1%)       69.60      (7.8%)    
1.2% ( -12% -   17%)
                Wildcard       92.81      (5.8%)       94.67      (6.2%)    
2.0% (  -9% -   14%)
   HighTermDayOfYearSort       61.99     (10.3%)       64.18     (10.9%)    
3.5% ( -16% -   27%)
                 Prefix3       41.42      (8.3%)       42.96      (8.2%)    
3.7% ( -11% -   22%)
                 LowTerm      694.99      (2.5%)     3126.69     (17.7%)  
349.9% ( 321% -  379%)
                HighTerm       58.04      (2.7%)      490.60     (58.6%)  
745.3% ( 666% -  828%)
                 MedTerm      120.80      (2.6%)     1053.44     (55.1%)  
772.1% ( 695% -  852%)
{noformat}

{{.doc}} file is 5.2% larger and the index is 1.5% larger overall.

> Allow codecs to index term impacts
> ----------------------------------
>
>                 Key: LUCENE-4198
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4198
>             Project: Lucene - Core
>          Issue Type: Sub-task
>          Components: core/index
>            Reporter: Robert Muir
>         Attachments: LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch, 
> LUCENE-4198.patch, LUCENE-4198_flush.patch
>
>
> Subtask of LUCENE-4100.
> Thats an example of something similar to impact indexing (though, his 
> implementation currently stores a max for the entire term, the problem is the 
> same).
> We can imagine other similar algorithms too: I think the codec API should be 
> able to support these.
> Currently it really doesnt: Stefan worked around the problem by providing a 
> tool to 'rewrite' your index, he passes the IndexReader and Similarity to it. 
> But it would be better if we fixed the codec API.
> One problem is that the Postings writer needs to have access to the 
> Similarity. Another problem is that it needs access to the term and 
> collection statistics up front, rather than after the fact.
> This might have some cost (hopefully minimal), so I'm thinking to experiment 
> in a branch with these changes and see if we can make it work well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4198) Allow codecs to index term impacts

Reply via email to