[jira] [Commented] (LUCENE-4702) Terms dictionary compression

Adrien Grand (JIRA) Mon, 28 Jan 2013 05:53:15 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564286#comment-13564286
 ]


Adrien Grand commented on LUCENE-4702:
--------------------------------------

bq. maybe for fun try varying the blocksize params (25,48) to get bigger 
blocks...

With minBlockSize=50 (instead of 25) and maxBlockSize=98 (instead of 48), there 
were 126420 bytes of .tim files (16% reduction compared to Lucene trunk, 2% 
reduction compared to min/maxBlockSize=(25,48)) but performance was worse, 
especially for fuzzy queries (baseline is Lucene trunk with the default values 
for min/maxBlockSize):

{noformat}
                    TaskQPS baseline      StdDevQPS compressed      StdDev      
          Pct diff
                  Fuzzy1      100.78      (3.0%)       57.61      (2.7%)  
-42.8% ( -47% -  -38%)
                  Fuzzy2       76.72      (3.8%)       46.24      (2.9%)  
-39.7% ( -44% -  -34%)
                 Respell       93.17      (3.5%)       65.20      (2.8%)  
-30.0% ( -35% -  -24%)
                Wildcard      222.57      (2.8%)      186.22      (3.6%)  
-16.3% ( -22% -  -10%)
              AndHighLow     1731.64      (3.7%)     1605.87      (4.7%)   
-7.3% ( -15% -    1%)
                 LowTerm     1860.60      (3.1%)     1743.77      (4.1%)   
-6.3% ( -13% -    0%)
              AndHighMed      816.71      (1.8%)      785.71      (2.3%)   
-3.8% (  -7% -    0%)
                 MedTerm      923.83      (3.3%)      896.61      (3.3%)   
-2.9% (  -9% -    3%)
               MedPhrase       49.85      (7.4%)       48.72      (7.7%)   
-2.3% ( -16% -   13%)
        HighSloppyPhrase       92.85      (4.6%)       90.92      (6.2%)   
-2.1% ( -12% -    9%)
               LowPhrase      183.01      (3.1%)      179.72      (4.1%)   
-1.8% (  -8% -    5%)
             LowSpanNear      115.03      (4.6%)      113.17      (3.9%)   
-1.6% (  -9% -    7%)
                HighTerm      352.00      (3.4%)      346.67      (3.6%)   
-1.5% (  -8% -    5%)
             MedSpanNear      193.22      (4.4%)      190.46      (4.1%)   
-1.4% (  -9% -    7%)
         MedSloppyPhrase      160.66      (4.2%)      158.38      (4.6%)   
-1.4% (  -9% -    7%)
               OrHighMed      177.29      (6.4%)      174.81      (6.7%)   
-1.4% ( -13% -   12%)
            HighSpanNear       42.47      (4.4%)       41.90      (4.4%)   
-1.3% (  -9% -    7%)
         LowSloppyPhrase      203.18      (2.5%)      200.57      (3.5%)   
-1.3% (  -7% -    4%)
               OrHighLow      149.20      (7.5%)      147.33      (7.7%)   
-1.3% ( -15% -   15%)
             AndHighHigh      216.43      (1.6%)      213.73      (1.9%)   
-1.2% (  -4% -    2%)
              HighPhrase       35.43      (8.6%)       35.06      (8.4%)   
-1.0% ( -16% -   17%)
                 Prefix3      455.95      (4.2%)      451.71      (4.0%)   
-0.9% (  -8% -    7%)
              OrHighHigh      100.72      (7.8%)      100.51      (7.6%)   
-0.2% ( -14% -   16%)
                  IntNRQ       62.50      (7.7%)       62.75      (8.2%)    
0.4% ( -14% -   17%)
                PKLookup      238.55      (4.9%)      241.72      (4.3%)    
1.3% (  -7% -   11%)
{noformat}
                
> Terms dictionary compression
> ----------------------------
>
>                 Key: LUCENE-4702
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4702
>             Project: Lucene - Core
>          Issue Type: Wish
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Trivial
>         Attachments: LUCENE-4702.patch
>
>
> I've done a quick test with the block tree terms dictionary by replacing a 
> call to IndexOutput.writeBytes to write suffix bytes with a call to 
> LZ4.compressHC to test the peformance hit. Interestingly, search performance 
> was very good (see comparison table below) and the tim files were 14% smaller 
> (from 150432 bytes overall to 129516).
> {noformat}
>                     TaskQPS baseline      StdDevQPS compressed      StdDev    
>             Pct diff
>                   Fuzzy1      111.50      (2.0%)       78.78      (1.5%)  
> -29.4% ( -32% -  -26%)
>                   Fuzzy2       36.99      (2.7%)       28.59      (1.5%)  
> -22.7% ( -26% -  -18%)
>                  Respell      122.86      (2.1%)      103.89      (1.7%)  
> -15.4% ( -18% -  -11%)
>                 Wildcard      100.58      (4.3%)       94.42      (3.2%)   
> -6.1% ( -13% -    1%)
>                  Prefix3      124.90      (5.7%)      122.67      (4.7%)   
> -1.8% ( -11% -    9%)
>                OrHighLow      169.87      (6.8%)      167.77      (8.0%)   
> -1.2% ( -15% -   14%)
>                  LowTerm     1949.85      (4.5%)     1929.02      (3.4%)   
> -1.1% (  -8% -    7%)
>               AndHighLow     2011.95      (3.5%)     1991.85      (3.3%)   
> -1.0% (  -7% -    5%)
>               OrHighHigh      155.63      (6.7%)      154.12      (7.9%)   
> -1.0% ( -14% -   14%)
>              AndHighHigh      341.82      (1.2%)      339.49      (1.7%)   
> -0.7% (  -3% -    2%)
>                OrHighMed      217.55      (6.3%)      216.16      (7.1%)   
> -0.6% ( -13% -   13%)
>                   IntNRQ       53.10     (10.9%)       52.90      (8.6%)   
> -0.4% ( -17% -   21%)
>                  MedTerm      998.11      (3.8%)      994.82      (5.6%)   
> -0.3% (  -9% -    9%)
>              MedSpanNear       60.50      (3.7%)       60.36      (4.8%)   
> -0.2% (  -8% -    8%)
>             HighSpanNear       19.74      (4.5%)       19.72      (5.1%)   
> -0.1% (  -9% -    9%)
>              LowSpanNear      101.93      (3.2%)      101.82      (4.4%)   
> -0.1% (  -7% -    7%)
>               AndHighMed      366.18      (1.7%)      366.93      (1.7%)    
> 0.2% (  -3% -    3%)
>                 PKLookup      237.28      (4.0%)      237.96      (4.2%)    
> 0.3% (  -7% -    8%)
>                MedPhrase      173.17      (4.7%)      174.69      (4.7%)    
> 0.9% (  -8% -   10%)
>          LowSloppyPhrase      180.91      (2.6%)      182.79      (2.7%)    
> 1.0% (  -4% -    6%)
>                LowPhrase      374.64      (5.5%)      379.11      (5.8%)    
> 1.2% (  -9% -   13%)
>                 HighTerm      253.14      (7.9%)      256.97     (11.4%)    
> 1.5% ( -16% -   22%)
>               HighPhrase       19.52     (10.6%)       19.83     (11.0%)    
> 1.6% ( -18% -   25%)
>          MedSloppyPhrase      141.90      (2.6%)      144.11      (2.5%)    
> 1.6% (  -3% -    6%)
>         HighSloppyPhrase       25.26      (4.8%)       25.97      (5.0%)    
> 2.8% (  -6% -   13%)
> {noformat}
> Only queries which are very terms-dictionary-intensive got a performance hit 
> (Fuzzy, Fuzzy2, Respell, Wildcard), other queries including Prefix3 behaved 
> (surprisingly) well.
> Do you think of it as something worth exploring?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4702) Terms dictionary compression

Reply via email to