[
https://issues.apache.org/jira/browse/LUCENE-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564286#comment-13564286
]
Adrien Grand commented on LUCENE-4702:
--------------------------------------
bq. maybe for fun try varying the blocksize params (25,48) to get bigger
blocks...
With minBlockSize=50 (instead of 25) and maxBlockSize=98 (instead of 48), there
were 126420 bytes of .tim files (16% reduction compared to Lucene trunk, 2%
reduction compared to min/maxBlockSize=(25,48)) but performance was worse,
especially for fuzzy queries (baseline is Lucene trunk with the default values
for min/maxBlockSize):
{noformat}
TaskQPS baseline StdDevQPS compressed StdDev
Pct diff
Fuzzy1 100.78 (3.0%) 57.61 (2.7%)
-42.8% ( -47% - -38%)
Fuzzy2 76.72 (3.8%) 46.24 (2.9%)
-39.7% ( -44% - -34%)
Respell 93.17 (3.5%) 65.20 (2.8%)
-30.0% ( -35% - -24%)
Wildcard 222.57 (2.8%) 186.22 (3.6%)
-16.3% ( -22% - -10%)
AndHighLow 1731.64 (3.7%) 1605.87 (4.7%)
-7.3% ( -15% - 1%)
LowTerm 1860.60 (3.1%) 1743.77 (4.1%)
-6.3% ( -13% - 0%)
AndHighMed 816.71 (1.8%) 785.71 (2.3%)
-3.8% ( -7% - 0%)
MedTerm 923.83 (3.3%) 896.61 (3.3%)
-2.9% ( -9% - 3%)
MedPhrase 49.85 (7.4%) 48.72 (7.7%)
-2.3% ( -16% - 13%)
HighSloppyPhrase 92.85 (4.6%) 90.92 (6.2%)
-2.1% ( -12% - 9%)
LowPhrase 183.01 (3.1%) 179.72 (4.1%)
-1.8% ( -8% - 5%)
LowSpanNear 115.03 (4.6%) 113.17 (3.9%)
-1.6% ( -9% - 7%)
HighTerm 352.00 (3.4%) 346.67 (3.6%)
-1.5% ( -8% - 5%)
MedSpanNear 193.22 (4.4%) 190.46 (4.1%)
-1.4% ( -9% - 7%)
MedSloppyPhrase 160.66 (4.2%) 158.38 (4.6%)
-1.4% ( -9% - 7%)
OrHighMed 177.29 (6.4%) 174.81 (6.7%)
-1.4% ( -13% - 12%)
HighSpanNear 42.47 (4.4%) 41.90 (4.4%)
-1.3% ( -9% - 7%)
LowSloppyPhrase 203.18 (2.5%) 200.57 (3.5%)
-1.3% ( -7% - 4%)
OrHighLow 149.20 (7.5%) 147.33 (7.7%)
-1.3% ( -15% - 15%)
AndHighHigh 216.43 (1.6%) 213.73 (1.9%)
-1.2% ( -4% - 2%)
HighPhrase 35.43 (8.6%) 35.06 (8.4%)
-1.0% ( -16% - 17%)
Prefix3 455.95 (4.2%) 451.71 (4.0%)
-0.9% ( -8% - 7%)
OrHighHigh 100.72 (7.8%) 100.51 (7.6%)
-0.2% ( -14% - 16%)
IntNRQ 62.50 (7.7%) 62.75 (8.2%)
0.4% ( -14% - 17%)
PKLookup 238.55 (4.9%) 241.72 (4.3%)
1.3% ( -7% - 11%)
{noformat}
> Terms dictionary compression
> ----------------------------
>
> Key: LUCENE-4702
> URL: https://issues.apache.org/jira/browse/LUCENE-4702
> Project: Lucene - Core
> Issue Type: Wish
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Priority: Trivial
> Attachments: LUCENE-4702.patch
>
>
> I've done a quick test with the block tree terms dictionary by replacing a
> call to IndexOutput.writeBytes to write suffix bytes with a call to
> LZ4.compressHC to test the peformance hit. Interestingly, search performance
> was very good (see comparison table below) and the tim files were 14% smaller
> (from 150432 bytes overall to 129516).
> {noformat}
> TaskQPS baseline StdDevQPS compressed StdDev
> Pct diff
> Fuzzy1 111.50 (2.0%) 78.78 (1.5%)
> -29.4% ( -32% - -26%)
> Fuzzy2 36.99 (2.7%) 28.59 (1.5%)
> -22.7% ( -26% - -18%)
> Respell 122.86 (2.1%) 103.89 (1.7%)
> -15.4% ( -18% - -11%)
> Wildcard 100.58 (4.3%) 94.42 (3.2%)
> -6.1% ( -13% - 1%)
> Prefix3 124.90 (5.7%) 122.67 (4.7%)
> -1.8% ( -11% - 9%)
> OrHighLow 169.87 (6.8%) 167.77 (8.0%)
> -1.2% ( -15% - 14%)
> LowTerm 1949.85 (4.5%) 1929.02 (3.4%)
> -1.1% ( -8% - 7%)
> AndHighLow 2011.95 (3.5%) 1991.85 (3.3%)
> -1.0% ( -7% - 5%)
> OrHighHigh 155.63 (6.7%) 154.12 (7.9%)
> -1.0% ( -14% - 14%)
> AndHighHigh 341.82 (1.2%) 339.49 (1.7%)
> -0.7% ( -3% - 2%)
> OrHighMed 217.55 (6.3%) 216.16 (7.1%)
> -0.6% ( -13% - 13%)
> IntNRQ 53.10 (10.9%) 52.90 (8.6%)
> -0.4% ( -17% - 21%)
> MedTerm 998.11 (3.8%) 994.82 (5.6%)
> -0.3% ( -9% - 9%)
> MedSpanNear 60.50 (3.7%) 60.36 (4.8%)
> -0.2% ( -8% - 8%)
> HighSpanNear 19.74 (4.5%) 19.72 (5.1%)
> -0.1% ( -9% - 9%)
> LowSpanNear 101.93 (3.2%) 101.82 (4.4%)
> -0.1% ( -7% - 7%)
> AndHighMed 366.18 (1.7%) 366.93 (1.7%)
> 0.2% ( -3% - 3%)
> PKLookup 237.28 (4.0%) 237.96 (4.2%)
> 0.3% ( -7% - 8%)
> MedPhrase 173.17 (4.7%) 174.69 (4.7%)
> 0.9% ( -8% - 10%)
> LowSloppyPhrase 180.91 (2.6%) 182.79 (2.7%)
> 1.0% ( -4% - 6%)
> LowPhrase 374.64 (5.5%) 379.11 (5.8%)
> 1.2% ( -9% - 13%)
> HighTerm 253.14 (7.9%) 256.97 (11.4%)
> 1.5% ( -16% - 22%)
> HighPhrase 19.52 (10.6%) 19.83 (11.0%)
> 1.6% ( -18% - 25%)
> MedSloppyPhrase 141.90 (2.6%) 144.11 (2.5%)
> 1.6% ( -3% - 6%)
> HighSloppyPhrase 25.26 (4.8%) 25.97 (5.0%)
> 2.8% ( -6% - 13%)
> {noformat}
> Only queries which are very terms-dictionary-intensive got a performance hit
> (Fuzzy, Fuzzy2, Respell, Wildcard), other queries including Prefix3 behaved
> (surprisingly) well.
> Do you think of it as something worth exploring?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]