[
https://issues.apache.org/jira/browse/LUCENE-5308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13806164#comment-13806164
]
Michael McCandless commented on LUCENE-5308:
--------------------------------------------
One nice side effect of the fixed-width encoding is it'd be simple to
build specialized decoders; it'd be done once (globally) after opening
a new reader, using asm.
To test this perf gain I hand-specialized for my current index, with
this:
{code}
// imageCount
counts[bytes[offset] & 0xFF]++;
// refCount
counts[234 + (bytes[offset+1] & 0xFF)]++;
// sectionCount
counts[490 + (bytes[offset+2] & 0xFF)]++;
// subSectionCount
counts[713 + (bytes[offset+3] & 0xFF)]++;
// subSubSectionCount
counts[950 + (bytes[offset+4] & 0xFF)]++;
// date
counts[1193 + ((bytes[offset+5] & 0xFF)<<8) + (bytes[offset+6] &
0xFF)]++;
// userName
counts[4473 + ((bytes[offset+7] & 0xFF)<<16) + ((bytes[offset+8] &
0xFF)<<8) + (bytes[offset+9]&0xFF)]++;
{code}
And it gave a nice further speedup:
{noformat}
Task QPS base StdDev QPS comp StdDev
Pct diff
Respell 54.06 (4.1%) 52.89 (3.3%)
-2.2% ( -9% - 5%)
OrNotHighLow 62.13 (6.9%) 62.79 (7.7%)
1.1% ( -12% - 16%)
MedSloppyPhrase 3.58 (6.5%) 3.63 (7.1%)
1.4% ( -11% - 16%)
HighSloppyPhrase 3.86 (8.6%) 3.93 (9.7%)
1.7% ( -15% - 21%)
LowSpanNear 9.06 (4.3%) 9.21 (4.8%)
1.7% ( -7% - 11%)
LowPhrase 12.30 (6.4%) 12.61 (7.0%)
2.5% ( -10% - 16%)
AndHighLow 401.45 (1.4%) 429.51 (1.9%)
7.0% ( 3% - 10%)
Fuzzy1 62.28 (2.2%) 66.91 (2.3%)
7.4% ( 2% - 12%)
LowSloppyPhrase 39.37 (1.7%) 42.77 (2.2%)
8.6% ( 4% - 12%)
MedSpanNear 26.77 (3.1%) 29.15 (3.2%)
8.9% ( 2% - 15%)
OrNotHighMed 32.14 (4.8%) 35.52 (6.4%)
10.5% ( 0% - 22%)
HighPhrase 4.07 (8.1%) 4.54 (10.0%)
11.7% ( -5% - 32%)
AndHighMed 27.72 (1.0%) 31.10 (0.8%)
12.2% ( 10% - 14%)
Fuzzy2 43.95 (2.4%) 50.09 (2.7%)
14.0% ( 8% - 19%)
AndHighHigh 25.06 (1.0%) 28.58 (0.9%)
14.0% ( 12% - 16%)
HighSpanNear 5.19 (3.5%) 6.03 (4.4%)
16.3% ( 8% - 25%)
MedPhrase 129.83 (4.8%) 151.45 (6.8%)
16.7% ( 4% - 29%)
Prefix3 27.68 (1.1%) 34.91 (1.2%)
26.1% ( 23% - 28%)
OrNotHighHigh 15.03 (2.1%) 19.11 (4.1%)
27.1% ( 20% - 34%)
MedTerm 26.60 (1.5%) 35.40 (3.1%)
33.1% ( 27% - 38%)
Wildcard 9.05 (1.8%) 12.15 (2.0%)
34.2% ( 29% - 38%)
OrHighNotMed 14.68 (1.6%) 19.84 (3.4%)
35.2% ( 29% - 40%)
OrHighNotHigh 8.79 (2.0%) 11.91 (4.3%)
35.5% ( 28% - 42%)
HighTerm 19.85 (1.7%) 26.98 (3.3%)
35.9% ( 30% - 41%)
LowTerm 132.41 (1.4%) 180.49 (3.9%)
36.3% ( 30% - 42%)
OrHighMed 11.67 (1.5%) 16.36 (3.4%)
40.2% ( 34% - 45%)
OrHighHigh 3.70 (1.9%) 5.34 (4.4%)
44.3% ( 37% - 51%)
OrHighNotLow 8.89 (1.7%) 12.90 (4.3%)
45.2% ( 38% - 52%)
OrHighLow 5.20 (1.7%) 7.57 (4.2%)
45.7% ( 39% - 52%)
IntNRQ 2.61 (1.1%) 4.03 (1.7%)
54.4% ( 50% - 57%)
{noformat}
> explore per-dimension fixed-width ordinal encoding
> --------------------------------------------------
>
> Key: LUCENE-5308
> URL: https://issues.apache.org/jira/browse/LUCENE-5308
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/facet
> Reporter: Michael McCandless
> Attachments: LUCENE-5308.patch
>
>
> I've been testing performance of Solr vs Lucene facets, and one area
> where Solr's "fcs" method shines (low RAM, high faceting perf) is in
> low-cardinality dimensions.
> I suspect the gains are because with the field-cache entries the ords
> are encoded in "column-stride" form, and are private to that dim (vs
> facet module's shared ord space).
> So I thought about whether we could do something like this in the
> facet module ...
> I.e., if we know certain documents will have a specific set of
> single-valued dimensions, we can pick an encoding format for the
> per-doc byte[] "globally" for all such documents, and use private ord
> space per-dimension to improve compression.
> The basic idea is to pre-assign up-front (before the segment is
> written) which bytes belong to which dim. E.g., date takes bytes 0-1
> (<= than 65536 unique labels), imageCount takes byte 2 (<= 256
> unique labels), username takes bytes 3-6 (<= 16.8 M unique labels),
> etc. This only works for single-valued dims, and only works if all
> docs (or at least an identifiable subset?) have all dims.
> To test this idea, I made a hacked up prototype patch; it has tons of
> limitations so we clearly can't commit it, but I was able to test full
> wikipedia en with 6 facet dims (date, username, refCount, imageCount,
> sectionCount, subSectionCount, subSubSectionCount).
> Trunk (base) requires 181 MB of net doc values to hold the facet ords,
> while the patch requires 183 MB.
> Perf:
> {noformat}
> Report after iter 19:
> Task QPS base StdDev QPS comp StdDev
> Pct diff
> Respell 54.30 (3.1%) 54.02 (2.7%)
> -0.5% ( -6% - 5%)
> MedSloppyPhrase 3.58 (5.6%) 3.60 (6.0%)
> 0.6% ( -10% - 12%)
> OrNotHighLow 63.58 (6.8%) 64.03 (6.9%)
> 0.7% ( -12% - 15%)
> HighSloppyPhrase 3.80 (7.4%) 3.84 (7.1%)
> 1.1% ( -12% - 16%)
> LowSpanNear 8.93 (3.5%) 9.09 (4.6%)
> 1.8% ( -6% - 10%)
> LowPhrase 12.15 (6.4%) 12.43 (7.2%)
> 2.3% ( -10% - 17%)
> AndHighLow 402.54 (1.4%) 425.23 (2.3%)
> 5.6% ( 1% - 9%)
> LowSloppyPhrase 39.53 (1.6%) 42.01 (1.9%)
> 6.3% ( 2% - 9%)
> MedSpanNear 26.54 (2.8%) 28.39 (3.6%)
> 7.0% ( 0% - 13%)
> HighPhrase 4.01 (8.1%) 4.30 (9.7%)
> 7.4% ( -9% - 27%)
> Fuzzy2 44.01 (2.3%) 47.43 (1.8%)
> 7.8% ( 3% - 12%)
> OrNotHighMed 32.64 (4.7%) 35.22 (5.5%)
> 7.9% ( -2% - 19%)
> Fuzzy1 62.24 (2.1%) 67.35 (1.9%)
> 8.2% ( 4% - 12%)
> MedPhrase 129.06 (4.9%) 141.14 (6.2%)
> 9.4% ( -1% - 21%)
> AndHighMed 27.71 (0.7%) 30.32 (1.1%)
> 9.4% ( 7% - 11%)
> HighSpanNear 5.15 (3.5%) 5.63 (4.2%)
> 9.5% ( 1% - 17%)
> AndHighHigh 24.98 (0.7%) 27.89 (1.1%)
> 11.7% ( 9% - 13%)
> OrNotHighHigh 15.13 (2.0%) 17.90 (2.6%)
> 18.3% ( 13% - 23%)
> Wildcard 9.06 (1.4%) 10.85 (2.6%)
> 19.8% ( 15% - 24%)
> OrHighNotHigh 8.84 (1.8%) 10.64 (2.6%)
> 20.3% ( 15% - 25%)
> OrHighHigh 3.73 (1.6%) 4.51 (2.4%)
> 20.9% ( 16% - 25%)
> OrHighLow 5.22 (1.5%) 6.34 (2.5%)
> 21.4% ( 17% - 25%)
> OrHighNotLow 8.94 (1.6%) 10.95 (2.5%)
> 22.5% ( 18% - 26%)
> Prefix3 27.61 (1.2%) 33.90 (2.3%)
> 22.8% ( 19% - 26%)
> OrHighMed 11.72 (1.6%) 14.56 (2.3%)
> 24.3% ( 20% - 28%)
> OrHighNotMed 14.74 (1.5%) 18.34 (2.2%)
> 24.5% ( 20% - 28%)
> MedTerm 26.37 (1.2%) 32.85 (2.7%)
> 24.6% ( 20% - 28%)
> IntNRQ 2.61 (1.2%) 3.25 (3.0%)
> 24.7% ( 20% - 29%)
> HighTerm 19.69 (1.3%) 25.33 (3.0%)
> 28.7% ( 23% - 33%)
> LowTerm 131.50 (1.3%) 170.49 (3.0%)
> 29.7% ( 25% - 34%)
> {noformat}
> I think the gains are sizable, and the increase in index size quite
> minor (in another test with fewer dims I saw the index size get a bit
> smaller) ... at least for this specific test.
> However, finding a clean solution here will be tricky...
--
This message was sent by Atlassian JIRA
(v6.1#6144)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]