Adrien Grand created LUCENE-4702:
------------------------------------

             Summary: Terms dictionary compression
                 Key: LUCENE-4702
                 URL: https://issues.apache.org/jira/browse/LUCENE-4702
             Project: Lucene - Core
          Issue Type: Wish
            Reporter: Adrien Grand
            Assignee: Adrien Grand
            Priority: Trivial


I've done a quick test with the block tree terms dictionary by replacing a call 
to IndexOutput.writeBytes to write suffix bytes with a call to LZ4.compressHC 
to test the peformance hit. Interestingly, search performance was very good 
(see comparison table below) and the tim files were 14% smaller (from 150432 
bytes overall to 129516).

{noformat}
                    TaskQPS baseline      StdDevQPS compressed      StdDev      
          Pct diff
                  Fuzzy1      111.50      (2.0%)       78.78      (1.5%)  
-29.4% ( -32% -  -26%)
                  Fuzzy2       36.99      (2.7%)       28.59      (1.5%)  
-22.7% ( -26% -  -18%)
                 Respell      122.86      (2.1%)      103.89      (1.7%)  
-15.4% ( -18% -  -11%)
                Wildcard      100.58      (4.3%)       94.42      (3.2%)   
-6.1% ( -13% -    1%)
                 Prefix3      124.90      (5.7%)      122.67      (4.7%)   
-1.8% ( -11% -    9%)
               OrHighLow      169.87      (6.8%)      167.77      (8.0%)   
-1.2% ( -15% -   14%)
                 LowTerm     1949.85      (4.5%)     1929.02      (3.4%)   
-1.1% (  -8% -    7%)
              AndHighLow     2011.95      (3.5%)     1991.85      (3.3%)   
-1.0% (  -7% -    5%)
              OrHighHigh      155.63      (6.7%)      154.12      (7.9%)   
-1.0% ( -14% -   14%)
             AndHighHigh      341.82      (1.2%)      339.49      (1.7%)   
-0.7% (  -3% -    2%)
               OrHighMed      217.55      (6.3%)      216.16      (7.1%)   
-0.6% ( -13% -   13%)
                  IntNRQ       53.10     (10.9%)       52.90      (8.6%)   
-0.4% ( -17% -   21%)
                 MedTerm      998.11      (3.8%)      994.82      (5.6%)   
-0.3% (  -9% -    9%)
             MedSpanNear       60.50      (3.7%)       60.36      (4.8%)   
-0.2% (  -8% -    8%)
            HighSpanNear       19.74      (4.5%)       19.72      (5.1%)   
-0.1% (  -9% -    9%)
             LowSpanNear      101.93      (3.2%)      101.82      (4.4%)   
-0.1% (  -7% -    7%)
              AndHighMed      366.18      (1.7%)      366.93      (1.7%)    
0.2% (  -3% -    3%)
                PKLookup      237.28      (4.0%)      237.96      (4.2%)    
0.3% (  -7% -    8%)
               MedPhrase      173.17      (4.7%)      174.69      (4.7%)    
0.9% (  -8% -   10%)
         LowSloppyPhrase      180.91      (2.6%)      182.79      (2.7%)    
1.0% (  -4% -    6%)
               LowPhrase      374.64      (5.5%)      379.11      (5.8%)    
1.2% (  -9% -   13%)
                HighTerm      253.14      (7.9%)      256.97     (11.4%)    
1.5% ( -16% -   22%)
              HighPhrase       19.52     (10.6%)       19.83     (11.0%)    
1.6% ( -18% -   25%)
         MedSloppyPhrase      141.90      (2.6%)      144.11      (2.5%)    
1.6% (  -3% -    6%)
        HighSloppyPhrase       25.26      (4.8%)       25.97      (5.0%)    
2.8% (  -6% -   13%)
{noformat}

Only queries which are very terms-dictionary-intensive got a performance hit 
(Fuzzy, Fuzzy2, Respell, Wildcard), other queries including Prefix3 behaved 
(surprisingly) well.

Do you think of it as something worth exploring?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to