[jira] [Commented] (LUCENE-6030) Add norms patched compression which uses table for most common values

Ryan Ernst (JIRA) Tue, 28 Oct 2014 11:46:49 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-6030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187265#comment-14187265
 ]


Ryan Ernst commented on LUCENE-6030:
------------------------------------

I've done some performance tests with luceneutil and the numbers are ok, but 
not great.  Hotspot seems to get confused sometimes, leading to a qps decline.

On java7, using wikimedium10m:
{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev        
        Pct diff
            OrNotHighMed       57.06      (8.9%)       49.56      (5.4%)  
-13.2% ( -25% -    1%)
            OrNotHighLow      121.53      (8.8%)      106.02      (5.7%)  
-12.8% ( -25% -    1%)
           OrNotHighHigh       58.92      (8.9%)       51.42      (5.4%)  
-12.7% ( -24% -    1%)
           OrHighNotHigh       68.08      (8.9%)       59.50      (5.5%)  
-12.6% ( -24% -    1%)
              OrHighHigh       25.97      (8.7%)       22.73      (5.3%)  
-12.5% ( -24% -    1%)
            OrHighNotLow       90.21      (8.8%)       80.24      (6.2%)  
-11.1% ( -23% -    4%)
                HighTerm      126.83      (1.8%)      112.85      (1.9%)  
-11.0% ( -14% -   -7%)
               OrHighLow      104.86      (8.8%)       93.32      (5.9%)  
-11.0% ( -23% -    4%)
            OrHighNotMed      109.46      (8.3%)      100.87      (6.0%)   
-7.8% ( -20% -    7%)
                 MedTerm      200.05      (1.7%)      187.49      (1.8%)   
-6.3% (  -9% -   -2%)
               OrHighMed      118.77      (8.0%)      113.79      (6.3%)   
-4.2% ( -17% -   10%)
                 Prefix3       82.16      (3.1%)       81.47      (4.4%)   
-0.8% (  -8% -    6%)
            HighSpanNear       14.16      (3.8%)       14.05      (4.1%)   
-0.8% (  -8% -    7%)
                  IntNRQ       11.53      (4.9%)       11.44      (6.4%)   
-0.8% ( -11% -   11%)
              HighPhrase        3.70     (14.2%)        3.67     (14.2%)   
-0.7% ( -25% -   32%)
        HighSloppyPhrase        4.46      (6.7%)        4.43      (6.1%)   
-0.7% ( -12% -   12%)
                  Fuzzy2       81.39      (2.5%)       81.43      (2.4%)    
0.0% (  -4% -    5%)
              AndHighLow     1104.54      (1.7%)     1105.90      (3.0%)    
0.1% (  -4% -    4%)
                Wildcard       42.71      (3.9%)       42.76      (3.6%)    
0.1% (  -7% -    7%)
                 Respell       74.16      (2.4%)       74.33      (1.9%)    
0.2% (  -3% -    4%)
             MedSpanNear       24.58      (3.3%)       24.69      (3.3%)    
0.5% (  -5% -    7%)
               LowPhrase       44.89      (2.1%)       45.17      (2.3%)    
0.6% (  -3% -    5%)
                  Fuzzy1       98.83      (2.5%)       99.49      (2.5%)    
0.7% (  -4% -    5%)
               MedPhrase      107.99      (6.0%)      109.06      (6.0%)    
1.0% ( -10% -   13%)
         MedSloppyPhrase       19.96      (3.0%)       20.24      (3.3%)    
1.4% (  -4% -    8%)
             LowSpanNear       37.75      (3.4%)       38.38      (3.5%)    
1.7% (  -5% -    8%)
         LowSloppyPhrase       31.39      (2.8%)       31.98      (3.2%)    
1.9% (  -4% -    8%)
             AndHighHigh       62.62      (1.0%)       64.48      (1.6%)    
3.0% (   0% -    5%)
              AndHighMed      187.48      (1.0%)      193.88      (1.6%)    
3.4% (   0% -    6%)
                 LowTerm      772.23      (2.9%)      970.78      (6.8%)   
25.7% (  15% -   36%)
{noformat}

On java 8, the decline is less pronounced:
{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev        
        Pct diff
                HighTerm      107.28      (4.2%)       92.63      (3.0%)  
-13.7% ( -19% -   -6%)
            OrNotHighLow      103.14     (10.2%)       94.37      (4.9%)   
-8.5% ( -21% -    7%)
            OrNotHighMed      103.75     (10.8%)       95.47      (5.3%)   
-8.0% ( -21% -    9%)
           OrNotHighHigh       39.62     (11.9%)       36.56      (6.3%)   
-7.7% ( -23% -   11%)
           OrHighNotHigh       31.88     (12.9%)       29.51      (7.1%)   
-7.4% ( -24% -   14%)
              OrHighHigh       26.44     (13.6%)       24.59      (7.9%)   
-7.0% ( -25% -   16%)
               OrHighLow       74.93     (14.5%)       70.41      (8.7%)   
-6.0% ( -25% -   20%)
            OrHighNotLow      106.31     (14.0%)      101.20      (8.7%)   
-4.8% ( -24% -   20%)
            OrHighNotMed       59.98     (13.5%)       57.84      (8.5%)   
-3.6% ( -22% -   21%)
              HighPhrase       78.65      (5.1%)       76.22      (4.5%)   
-3.1% ( -12% -    6%)
        HighSloppyPhrase       18.62      (6.5%)       18.32      (4.7%)   
-1.6% ( -12% -   10%)
               OrHighMed       79.70     (13.3%)       78.73      (9.0%)   
-1.2% ( -20% -   24%)
               MedPhrase       26.06      (3.4%)       25.94      (3.1%)   
-0.5% (  -6% -    6%)
                  Fuzzy2      114.17      (3.4%)      113.86      (3.5%)   
-0.3% (  -6% -    6%)
            HighSpanNear       27.20      (6.2%)       27.21      (5.0%)    
0.0% ( -10% -   11%)
               LowPhrase       36.88      (2.1%)       36.95      (2.1%)    
0.2% (  -4% -    4%)
                  Fuzzy1      136.96      (3.2%)      137.26      (3.5%)    
0.2% (  -6% -    7%)
              AndHighLow     1517.11      (4.2%)     1523.95      (4.1%)    
0.5% (  -7% -    9%)
                 Respell       87.37      (2.8%)       87.85      (2.6%)    
0.5% (  -4% -    6%)
         LowSloppyPhrase       63.60      (4.2%)       64.10      (3.5%)    
0.8% (  -6% -    8%)
                Wildcard       20.92      (4.7%)       21.09      (3.2%)    
0.8% (  -6% -    9%)
                 MedTerm      359.22      (3.1%)      362.24      (3.0%)    
0.8% (  -5% -    7%)
             MedSpanNear       14.74      (4.5%)       14.90      (4.3%)    
1.0% (  -7% -   10%)
                 Prefix3       51.84      (6.8%)       52.41      (5.0%)    
1.1% (  -9% -   13%)
                  IntNRQ       12.60      (8.0%)       12.79      (5.8%)    
1.5% ( -11% -   16%)
              AndHighMed      338.81      (1.5%)      345.34      (1.5%)    
1.9% (  -1% -    5%)
         MedSloppyPhrase       60.72      (6.1%)       61.97      (5.1%)    
2.1% (  -8% -   14%)
             AndHighHigh       77.59      (1.4%)       80.17      (1.4%)    
3.3% (   0% -    6%)
             LowSpanNear      215.18      (5.4%)      223.41      (4.4%)    
3.8% (  -5% -   14%)
                 LowTerm     1043.18      (5.0%)     1123.42      (5.9%)    
7.7% (  -2% -   19%)
{noformat}

However, this has a huge size impact.  For the wikimedium10m, the size of norms 
was reduced by about half:
{noformat}
rjernst@codex:~/code/ls-util$ du -cksh 
indices/wikimedium10m.trunk.Lucene50.nd10M/index/*.nvd
1.8M    indices/wikimedium10m.trunk.Lucene50.nd10M/index/_32.nvd
1.8M    indices/wikimedium10m.trunk.Lucene50.nd10M/index/_65.nvd
1.8M    indices/wikimedium10m.trunk.Lucene50.nd10M/index/_98.nvd
1.8M    indices/wikimedium10m.trunk.Lucene50.nd10M/index/_cb.nvd
1.8M    indices/wikimedium10m.trunk.Lucene50.nd10M/index/_fe.nvd
180K    indices/wikimedium10m.trunk.Lucene50.nd10M/index/_fp.nvd
180K    indices/wikimedium10m.trunk.Lucene50.nd10M/index/_g0.nvd
180K    indices/wikimedium10m.trunk.Lucene50.nd10M/index/_gb.nvd
92K     indices/wikimedium10m.trunk.Lucene50.nd10M/index/_gm.nvd
180K    indices/wikimedium10m.trunk.Lucene50.nd10M/index/_gx.nvd
20K     indices/wikimedium10m.trunk.Lucene50.nd10M/index/_gy.nvd
12K     indices/wikimedium10m.trunk.Lucene50.nd10M/index/_gz.nvd
12K     indices/wikimedium10m.trunk.Lucene50.nd10M/index/_h0.nvd
12K     indices/wikimedium10m.trunk.Lucene50.nd10M/index/_h1.nvd
12K     indices/wikimedium10m.trunk.Lucene50.nd10M/index/_h2.nvd
4.0K    indices/wikimedium10m.trunk.Lucene50.nd10M/index/_h3.nvd
9.5M    total

 du -cksh indices/wikimedium10m.patch.Lucene50.nd10M/index/*.nvd
880K    indices/wikimedium10m.patch.Lucene50.nd10M/index/_32.nvd
880K    indices/wikimedium10m.patch.Lucene50.nd10M/index/_65.nvd
880K    indices/wikimedium10m.patch.Lucene50.nd10M/index/_98.nvd
880K    indices/wikimedium10m.patch.Lucene50.nd10M/index/_cb.nvd
880K    indices/wikimedium10m.patch.Lucene50.nd10M/index/_fe.nvd
92K     indices/wikimedium10m.patch.Lucene50.nd10M/index/_fp.nvd
92K     indices/wikimedium10m.patch.Lucene50.nd10M/index/_g0.nvd
92K     indices/wikimedium10m.patch.Lucene50.nd10M/index/_gb.nvd
92K     indices/wikimedium10m.patch.Lucene50.nd10M/index/_gm.nvd
92K     indices/wikimedium10m.patch.Lucene50.nd10M/index/_gx.nvd
12K     indices/wikimedium10m.patch.Lucene50.nd10M/index/_gy.nvd
12K     indices/wikimedium10m.patch.Lucene50.nd10M/index/_gz.nvd
12K     indices/wikimedium10m.patch.Lucene50.nd10M/index/_h0.nvd
12K     indices/wikimedium10m.patch.Lucene50.nd10M/index/_h1.nvd
12K     indices/wikimedium10m.patch.Lucene50.nd10M/index/_h2.nvd
4.0K    indices/wikimedium10m.patch.Lucene50.nd10M/index/_h3.nvd
4.9M    total
{noformat}

> Add norms patched compression which uses table for most common values
> ---------------------------------------------------------------------
>
>                 Key: LUCENE-6030
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6030
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Ryan Ernst
>         Attachments: LUCENE-6030.patch
>
>
> We have added the PATCHED norms sub format in lucene 50, which uses a bitset 
> to mark documents that have the most common value (when >97% of the documents 
> have that value).  This works well for fields that have a predominant value 
> length, and then a small number of docs with some other random values.  But 
> another common case is having a handful of very common value lengths, like 
> with a title field.
> We can use a table (see TABLE_COMPRESSION) to store the most common values, 
> and save an oridinal for the "other" case, at which point we can lookup in 
> the secondary patch table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-6030) Add norms patched compression which uses table for most common values

Reply via email to