Adrien Grand created LUCENE-4702:
------------------------------------
Summary: Terms dictionary compression
Key: LUCENE-4702
URL: https://issues.apache.org/jira/browse/LUCENE-4702
Project: Lucene - Core
Issue Type: Wish
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Trivial
I've done a quick test with the block tree terms dictionary by replacing a call
to IndexOutput.writeBytes to write suffix bytes with a call to LZ4.compressHC
to test the peformance hit. Interestingly, search performance was very good
(see comparison table below) and the tim files were 14% smaller (from 150432
bytes overall to 129516).
{noformat}
TaskQPS baseline StdDevQPS compressed StdDev
Pct diff
Fuzzy1 111.50 (2.0%) 78.78 (1.5%)
-29.4% ( -32% - -26%)
Fuzzy2 36.99 (2.7%) 28.59 (1.5%)
-22.7% ( -26% - -18%)
Respell 122.86 (2.1%) 103.89 (1.7%)
-15.4% ( -18% - -11%)
Wildcard 100.58 (4.3%) 94.42 (3.2%)
-6.1% ( -13% - 1%)
Prefix3 124.90 (5.7%) 122.67 (4.7%)
-1.8% ( -11% - 9%)
OrHighLow 169.87 (6.8%) 167.77 (8.0%)
-1.2% ( -15% - 14%)
LowTerm 1949.85 (4.5%) 1929.02 (3.4%)
-1.1% ( -8% - 7%)
AndHighLow 2011.95 (3.5%) 1991.85 (3.3%)
-1.0% ( -7% - 5%)
OrHighHigh 155.63 (6.7%) 154.12 (7.9%)
-1.0% ( -14% - 14%)
AndHighHigh 341.82 (1.2%) 339.49 (1.7%)
-0.7% ( -3% - 2%)
OrHighMed 217.55 (6.3%) 216.16 (7.1%)
-0.6% ( -13% - 13%)
IntNRQ 53.10 (10.9%) 52.90 (8.6%)
-0.4% ( -17% - 21%)
MedTerm 998.11 (3.8%) 994.82 (5.6%)
-0.3% ( -9% - 9%)
MedSpanNear 60.50 (3.7%) 60.36 (4.8%)
-0.2% ( -8% - 8%)
HighSpanNear 19.74 (4.5%) 19.72 (5.1%)
-0.1% ( -9% - 9%)
LowSpanNear 101.93 (3.2%) 101.82 (4.4%)
-0.1% ( -7% - 7%)
AndHighMed 366.18 (1.7%) 366.93 (1.7%)
0.2% ( -3% - 3%)
PKLookup 237.28 (4.0%) 237.96 (4.2%)
0.3% ( -7% - 8%)
MedPhrase 173.17 (4.7%) 174.69 (4.7%)
0.9% ( -8% - 10%)
LowSloppyPhrase 180.91 (2.6%) 182.79 (2.7%)
1.0% ( -4% - 6%)
LowPhrase 374.64 (5.5%) 379.11 (5.8%)
1.2% ( -9% - 13%)
HighTerm 253.14 (7.9%) 256.97 (11.4%)
1.5% ( -16% - 22%)
HighPhrase 19.52 (10.6%) 19.83 (11.0%)
1.6% ( -18% - 25%)
MedSloppyPhrase 141.90 (2.6%) 144.11 (2.5%)
1.6% ( -3% - 6%)
HighSloppyPhrase 25.26 (4.8%) 25.97 (5.0%)
2.8% ( -6% - 13%)
{noformat}
Only queries which are very terms-dictionary-intensive got a performance hit
(Fuzzy, Fuzzy2, Respell, Wildcard), other queries including Prefix3 behaved
(surprisingly) well.
Do you think of it as something worth exploring?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]