[jira] [Commented] (LUCENE-2962) Skip data should be inlined into the postings lists

Han Jiang (JIRA) Mon, 22 Apr 2013 06:01:21 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13637983#comment-13637983
 ]


Han Jiang commented on LUCENE-2962:
-----------------------------------

A full summary of skip frequency in wikimedium.10M.nostopwords.tasks, and part 
of crazyRandomMinShouldMatch.tasks. The latter one is Really crazy :)
The 'skip_len' is actually counted as (newDocUpto-docUpto) in 
Lucene41PostingsReader.*Enum.advance(target), when skip doesn't move,
counted as 0, otherwise the number of docs skipped. I changed codes in 
luceneutil so that, each line of query is taken into account:

#query_category      #num_query  #num called  #max_skip_len  #tot_skip_len  
#avg_skip_len  #std_dev_skip_len
and_high_high:       500         18021935     14633          110997027      
6.158996       25.283280
and_high_med:        500         9145928      22730          233710779      
25.553534      61.885853
and_high_low:        500         1385125      215533         1073755035     
775.204429     1606.345686
high_phrase:         42          253569       3284           5113544        
20.166282      56.904256
high_sloppy_phrase:  42          2441007      3284           11993572       
4.913371       23.253660
high_span_near:      42          2362258      3284           11846707       
5.014993       23.604965
low_phrase:          500         6936508      21180          247018573      
35.611373      103.751734
low_sloppy_phrase:   500         18170618     21180          298025713      
16.401518      66.808480
low_span_near:       500         18100056     21180          296733920      
16.394089      66.895263
med_phrase:          500         4513849      26367          144556764      
32.025166      83.814376
med_sloppy_phrase:   500         17683175     26367          197756027      
11.183287      45.764898
med_span_near:       500         17503372     26367          196409780      
11.221254      45.958612
10terms_0high_2msm:  22          10875        32768          2502731        
230.136184     1319.894640
10terms_0high_3msm:  17          17127        15743          440149         
25.699130      209.870841
10terms_0high_4msm:  27          27144        24192          2156919        
79.462091      640.948445 
10terms_0high_5msm:  21          19564        26479          1829846        
93.531282      773.820054
10terms_0high_6msm:  27          17555        31232          1615071        
92.000627      745.978516
10terms_0high_7msm:  27          16618        18688          1208893        
72.745998      505.996915 
10terms_0high_8msm:  25          10722        17024          817872         
76.279799      451.833907
10terms_0high_9msm:  16          5371         11008          411776         
76.666543      353.098379
10terms_0high_10msm: 21          10403        32768          7325395        
704.161780     2504.260576
10terms_5high_2msm:  24          650096       2163           1832245        
2.818422       18.513591
10terms_5high_5msm:  24          1123877      276224         128339073      
114.193166     936.887693
10terms_5high_10msm: 24          14211        1663232        322730000      
22709.872634   115150.031194

This drives me to test, whether it is really necessary to use multi-level skip 
structure for simpler queries like AndQuery & PhraseQuery.
So I set skipMultiplier=8000000 to make sure that Lucene41SkipWriter won't 
create a level >1 skip list, which is marked as 'comp'. 
And a clean trunk (skipMultiplier=8) used as 'base':

                    Task    QPS base      StdDev    QPS comp      StdDev        
        Pct diff
               LowPhrase       34.86      (2.7%)       32.23      (1.6%)   
-7.5% ( -11% -   -3%)
                 LowTerm      335.88      (8.8%)      326.09      (7.8%)   
-2.9% ( -17% -   14%)
            HighSpanNear        7.05      (2.2%)        6.97      (0.7%)   
-1.1% (  -3% -    1%)
              AndHighMed       52.22      (1.3%)       51.72      (1.0%)   
-1.0% (  -3% -    1%)
             MedSpanNear        4.30      (2.1%)        4.26      (0.7%)   
-0.8% (  -3% -    2%)
             LowSpanNear       42.46      (1.7%)       42.28      (0.6%)   
-0.4% (  -2% -    1%)
                  Fuzzy2       59.56      (5.4%)       59.31      (4.7%)   
-0.4% (  -9% -   10%)
         LowSloppyPhrase       10.33      (2.6%)       10.30      (2.5%)   
-0.3% (  -5% -    4%)
             AndHighHigh       18.37      (0.6%)       18.33      (0.3%)   
-0.2% (  -1% -    0%)
                  Fuzzy1       53.70      (5.3%)       53.59      (5.2%)   
-0.2% ( -10% -   10%)
              HighPhrase        2.56      (6.5%)        2.56      (5.6%)   
-0.2% ( -11% -   12%)
                HighTerm       57.36     (15.2%)       57.34     (15.0%)   
-0.0% ( -26% -   35%)
                 MedTerm       90.08     (13.9%)       90.30     (13.6%)    
0.2% ( -23% -   32%)
                  IntNRQ        2.82     (13.3%)        2.83     (11.7%)    
0.3% ( -21% -   29%)
               MedPhrase       15.18      (8.6%)       15.23      (8.4%)    
0.3% ( -15% -   19%)
         MedSloppyPhrase        2.17      (4.2%)        2.18      (3.7%)    
0.6% (  -6% -    8%)
               OrHighMed       20.30     (14.5%)       20.47     (14.4%)    
0.8% ( -24% -   34%)
                Wildcard       21.53      (5.6%)       21.71      (4.9%)    
0.8% (  -9% -   12%)
               OrHighLow       17.26     (15.0%)       17.43     (15.0%)    
1.0% ( -25% -   36%)
        HighSloppyPhrase        8.31      (4.2%)        8.39      (4.5%)    
1.0% (  -7% -   10%)
                 Prefix3       22.70      (5.8%)       22.93      (5.1%)    
1.0% (  -9% -   12%)
              OrHighHigh       15.51     (14.7%)       15.69     (14.8%)    
1.2% ( -24% -   35%)
                 Respell       41.39      (3.4%)       42.01      (3.3%)    
1.5% (  -5% -    8%)
              AndHighLow      459.48      (2.3%)      468.22      (2.1%)    
1.9% (  -2% -    6%)
                PKLookup      251.05      (4.4%)      259.80      (2.8%)    
3.5% (  -3% -   11%)
                
> Skip data should be inlined into the postings lists
> ---------------------------------------------------
>
>                 Key: LUCENE-2962
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2962
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael McCandless
>              Labels: gsoc2013
>         Attachments: proposal.txt
>
>
> Today, we store all skip data as a separate blob at the end of a given term's 
> postings (if that term occurs in enough docs to warrant skip data).
> But this adds overhead during decoding -- we have to seek to a different 
> place for the initial load, we have to init separate readers, we have to seek 
> again while using the lower levels of the skip data, etc.  Also, we have to 
> fully decode all skip information even if we are not going to use it (eg if I 
> only want docIDs, I still must decode position offset and lastPayloadLength).
> If instead we interleaved skip data into the postings file, we could keep it 
> local, and "private" to each file that needs skipping.  This should make it 
> least costly to init and then use the skip data, which'd be a good perf gain 
> for eg PhraseQuery, AndQuery.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-2962) Skip data should be inlined into the postings lists

Reply via email to