[jira] [Created] (LUCENE-5308) explore per-dimension fixed-width ordinal encoding

Michael McCandless (JIRA) Sat, 26 Oct 2013 11:26:16 -0700

Michael McCandless created LUCENE-5308:
------------------------------------------


             Summary: explore per-dimension fixed-width ordinal encoding
                 Key: LUCENE-5308
                 URL: https://issues.apache.org/jira/browse/LUCENE-5308
             Project: Lucene - Core
          Issue Type: Improvement
          Components: modules/facet
            Reporter: Michael McCandless


I've been testing performance of Solr vs Lucene facets, and one area
where Solr's "fcs" method shines (low RAM, high faceting perf) is in
low-cardinality dimensions.

I suspect the gains are because with the field-cache entries the ords
are encoded in "column-stride" form, and are private to that dim (vs
facet module's shared ord space).

So I thought about whether we could do something like this in the
facet module ...

I.e., if we know certain documents will have a specific set of
single-valued dimensions, we can pick an encoding format for the
per-doc byte[] "globally" for all such documents, and use private ord
space per-dimension to improve compression.

The basic idea is to pre-assign up-front (before the segment is
written) which bytes belong to which dim.  E.g., date takes bytes 0-1
(<= than 65536 unique labels), imageCount takes byte 2 (<= 256
unique labels), username takes bytes 3-6 (<= 16.8 M unique labels),
etc.  This only works for single-valued dims, and only works if all
docs (or at least an identifiable subset?) have all dims.

To test this idea, I made a hacked up prototype patch; it has tons of
limitations so we clearly can't commit it, but I was able to test full
wikipedia en with 6 facet dims (date, username, refCount, imageCount,
sectionCount, subSectionCount, subSubSectionCount).

Trunk (base) requires 181 MB of net doc values to hold the facet ords,
while the patch requires 183 MB.

Perf:

{noformat}
Report after iter 19:
                    Task    QPS base      StdDev    QPS comp      StdDev        
        Pct diff
                 Respell       54.30      (3.1%)       54.02      (2.7%)   
-0.5% (  -6% -    5%)
         MedSloppyPhrase        3.58      (5.6%)        3.60      (6.0%)    
0.6% ( -10% -   12%)
            OrNotHighLow       63.58      (6.8%)       64.03      (6.9%)    
0.7% ( -12% -   15%)
        HighSloppyPhrase        3.80      (7.4%)        3.84      (7.1%)    
1.1% ( -12% -   16%)
             LowSpanNear        8.93      (3.5%)        9.09      (4.6%)    
1.8% (  -6% -   10%)
               LowPhrase       12.15      (6.4%)       12.43      (7.2%)    
2.3% ( -10% -   17%)
              AndHighLow      402.54      (1.4%)      425.23      (2.3%)    
5.6% (   1% -    9%)
         LowSloppyPhrase       39.53      (1.6%)       42.01      (1.9%)    
6.3% (   2% -    9%)
             MedSpanNear       26.54      (2.8%)       28.39      (3.6%)    
7.0% (   0% -   13%)
              HighPhrase        4.01      (8.1%)        4.30      (9.7%)    
7.4% (  -9% -   27%)
                  Fuzzy2       44.01      (2.3%)       47.43      (1.8%)    
7.8% (   3% -   12%)
            OrNotHighMed       32.64      (4.7%)       35.22      (5.5%)    
7.9% (  -2% -   19%)
                  Fuzzy1       62.24      (2.1%)       67.35      (1.9%)    
8.2% (   4% -   12%)
               MedPhrase      129.06      (4.9%)      141.14      (6.2%)    
9.4% (  -1% -   21%)
              AndHighMed       27.71      (0.7%)       30.32      (1.1%)    
9.4% (   7% -   11%)
            HighSpanNear        5.15      (3.5%)        5.63      (4.2%)    
9.5% (   1% -   17%)
             AndHighHigh       24.98      (0.7%)       27.89      (1.1%)   
11.7% (   9% -   13%)
           OrNotHighHigh       15.13      (2.0%)       17.90      (2.6%)   
18.3% (  13% -   23%)
                Wildcard        9.06      (1.4%)       10.85      (2.6%)   
19.8% (  15% -   24%)
           OrHighNotHigh        8.84      (1.8%)       10.64      (2.6%)   
20.3% (  15% -   25%)
              OrHighHigh        3.73      (1.6%)        4.51      (2.4%)   
20.9% (  16% -   25%)
               OrHighLow        5.22      (1.5%)        6.34      (2.5%)   
21.4% (  17% -   25%)
            OrHighNotLow        8.94      (1.6%)       10.95      (2.5%)   
22.5% (  18% -   26%)
                 Prefix3       27.61      (1.2%)       33.90      (2.3%)   
22.8% (  19% -   26%)
               OrHighMed       11.72      (1.6%)       14.56      (2.3%)   
24.3% (  20% -   28%)
            OrHighNotMed       14.74      (1.5%)       18.34      (2.2%)   
24.5% (  20% -   28%)
                 MedTerm       26.37      (1.2%)       32.85      (2.7%)   
24.6% (  20% -   28%)
                  IntNRQ        2.61      (1.2%)        3.25      (3.0%)   
24.7% (  20% -   29%)
                HighTerm       19.69      (1.3%)       25.33      (3.0%)   
28.7% (  23% -   33%)
                 LowTerm      131.50      (1.3%)      170.49      (3.0%)   
29.7% (  25% -   34%)
{noformat}

I think the gains are sizable, and the increase in index size quite
minor (in another test with fewer dims I saw the index size get a bit
smaller) ... at least for this specific test.

However, finding a clean solution here will be tricky...




--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (LUCENE-5308) explore per-dimension fixed-width ordinal encoding

Reply via email to