Michael McCandless created LUCENE-5308:
------------------------------------------
Summary: explore per-dimension fixed-width ordinal encoding
Key: LUCENE-5308
URL: https://issues.apache.org/jira/browse/LUCENE-5308
Project: Lucene - Core
Issue Type: Improvement
Components: modules/facet
Reporter: Michael McCandless
I've been testing performance of Solr vs Lucene facets, and one area
where Solr's "fcs" method shines (low RAM, high faceting perf) is in
low-cardinality dimensions.
I suspect the gains are because with the field-cache entries the ords
are encoded in "column-stride" form, and are private to that dim (vs
facet module's shared ord space).
So I thought about whether we could do something like this in the
facet module ...
I.e., if we know certain documents will have a specific set of
single-valued dimensions, we can pick an encoding format for the
per-doc byte[] "globally" for all such documents, and use private ord
space per-dimension to improve compression.
The basic idea is to pre-assign up-front (before the segment is
written) which bytes belong to which dim. E.g., date takes bytes 0-1
(<= than 65536 unique labels), imageCount takes byte 2 (<= 256
unique labels), username takes bytes 3-6 (<= 16.8 M unique labels),
etc. This only works for single-valued dims, and only works if all
docs (or at least an identifiable subset?) have all dims.
To test this idea, I made a hacked up prototype patch; it has tons of
limitations so we clearly can't commit it, but I was able to test full
wikipedia en with 6 facet dims (date, username, refCount, imageCount,
sectionCount, subSectionCount, subSubSectionCount).
Trunk (base) requires 181 MB of net doc values to hold the facet ords,
while the patch requires 183 MB.
Perf:
{noformat}
Report after iter 19:
Task QPS base StdDev QPS comp StdDev
Pct diff
Respell 54.30 (3.1%) 54.02 (2.7%)
-0.5% ( -6% - 5%)
MedSloppyPhrase 3.58 (5.6%) 3.60 (6.0%)
0.6% ( -10% - 12%)
OrNotHighLow 63.58 (6.8%) 64.03 (6.9%)
0.7% ( -12% - 15%)
HighSloppyPhrase 3.80 (7.4%) 3.84 (7.1%)
1.1% ( -12% - 16%)
LowSpanNear 8.93 (3.5%) 9.09 (4.6%)
1.8% ( -6% - 10%)
LowPhrase 12.15 (6.4%) 12.43 (7.2%)
2.3% ( -10% - 17%)
AndHighLow 402.54 (1.4%) 425.23 (2.3%)
5.6% ( 1% - 9%)
LowSloppyPhrase 39.53 (1.6%) 42.01 (1.9%)
6.3% ( 2% - 9%)
MedSpanNear 26.54 (2.8%) 28.39 (3.6%)
7.0% ( 0% - 13%)
HighPhrase 4.01 (8.1%) 4.30 (9.7%)
7.4% ( -9% - 27%)
Fuzzy2 44.01 (2.3%) 47.43 (1.8%)
7.8% ( 3% - 12%)
OrNotHighMed 32.64 (4.7%) 35.22 (5.5%)
7.9% ( -2% - 19%)
Fuzzy1 62.24 (2.1%) 67.35 (1.9%)
8.2% ( 4% - 12%)
MedPhrase 129.06 (4.9%) 141.14 (6.2%)
9.4% ( -1% - 21%)
AndHighMed 27.71 (0.7%) 30.32 (1.1%)
9.4% ( 7% - 11%)
HighSpanNear 5.15 (3.5%) 5.63 (4.2%)
9.5% ( 1% - 17%)
AndHighHigh 24.98 (0.7%) 27.89 (1.1%)
11.7% ( 9% - 13%)
OrNotHighHigh 15.13 (2.0%) 17.90 (2.6%)
18.3% ( 13% - 23%)
Wildcard 9.06 (1.4%) 10.85 (2.6%)
19.8% ( 15% - 24%)
OrHighNotHigh 8.84 (1.8%) 10.64 (2.6%)
20.3% ( 15% - 25%)
OrHighHigh 3.73 (1.6%) 4.51 (2.4%)
20.9% ( 16% - 25%)
OrHighLow 5.22 (1.5%) 6.34 (2.5%)
21.4% ( 17% - 25%)
OrHighNotLow 8.94 (1.6%) 10.95 (2.5%)
22.5% ( 18% - 26%)
Prefix3 27.61 (1.2%) 33.90 (2.3%)
22.8% ( 19% - 26%)
OrHighMed 11.72 (1.6%) 14.56 (2.3%)
24.3% ( 20% - 28%)
OrHighNotMed 14.74 (1.5%) 18.34 (2.2%)
24.5% ( 20% - 28%)
MedTerm 26.37 (1.2%) 32.85 (2.7%)
24.6% ( 20% - 28%)
IntNRQ 2.61 (1.2%) 3.25 (3.0%)
24.7% ( 20% - 29%)
HighTerm 19.69 (1.3%) 25.33 (3.0%)
28.7% ( 23% - 33%)
LowTerm 131.50 (1.3%) 170.49 (3.0%)
29.7% ( 25% - 34%)
{noformat}
I think the gains are sizable, and the increase in index size quite
minor (in another test with fewer dims I saw the index size get a bit
smaller) ... at least for this specific test.
However, finding a clean solution here will be tricky...
--
This message was sent by Atlassian JIRA
(v6.1#6144)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]