[jira] [Commented] (LUCENE-5308) explore per-dimension fixed-width ordinal encoding

Michael McCandless (JIRA) Sat, 26 Oct 2013 13:53:22 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13806164#comment-13806164
 ]


Michael McCandless commented on LUCENE-5308:
--------------------------------------------

One nice side effect of the fixed-width encoding is it'd be simple to
build specialized decoders; it'd be done once (globally) after opening
a new reader, using asm.

To test this perf gain I hand-specialized for my current index, with
this:

{code}
        // imageCount
        counts[bytes[offset] & 0xFF]++;
        // refCount
        counts[234 + (bytes[offset+1] & 0xFF)]++;
        // sectionCount
        counts[490 + (bytes[offset+2] & 0xFF)]++;
        // subSectionCount
        counts[713 + (bytes[offset+3] & 0xFF)]++;
        // subSubSectionCount
        counts[950 + (bytes[offset+4] & 0xFF)]++;
        // date
        counts[1193 + ((bytes[offset+5] & 0xFF)<<8) + (bytes[offset+6] & 
0xFF)]++;
        // userName
        counts[4473 + ((bytes[offset+7] & 0xFF)<<16) + ((bytes[offset+8] & 
0xFF)<<8) + (bytes[offset+9]&0xFF)]++;
{code}

And it gave a nice further speedup:

{noformat}
                    Task    QPS base      StdDev    QPS comp      StdDev        
        Pct diff
                 Respell       54.06      (4.1%)       52.89      (3.3%)   
-2.2% (  -9% -    5%)
            OrNotHighLow       62.13      (6.9%)       62.79      (7.7%)    
1.1% ( -12% -   16%)
         MedSloppyPhrase        3.58      (6.5%)        3.63      (7.1%)    
1.4% ( -11% -   16%)
        HighSloppyPhrase        3.86      (8.6%)        3.93      (9.7%)    
1.7% ( -15% -   21%)
             LowSpanNear        9.06      (4.3%)        9.21      (4.8%)    
1.7% (  -7% -   11%)
               LowPhrase       12.30      (6.4%)       12.61      (7.0%)    
2.5% ( -10% -   16%)
              AndHighLow      401.45      (1.4%)      429.51      (1.9%)    
7.0% (   3% -   10%)
                  Fuzzy1       62.28      (2.2%)       66.91      (2.3%)    
7.4% (   2% -   12%)
         LowSloppyPhrase       39.37      (1.7%)       42.77      (2.2%)    
8.6% (   4% -   12%)
             MedSpanNear       26.77      (3.1%)       29.15      (3.2%)    
8.9% (   2% -   15%)
            OrNotHighMed       32.14      (4.8%)       35.52      (6.4%)   
10.5% (   0% -   22%)
              HighPhrase        4.07      (8.1%)        4.54     (10.0%)   
11.7% (  -5% -   32%)
              AndHighMed       27.72      (1.0%)       31.10      (0.8%)   
12.2% (  10% -   14%)
                  Fuzzy2       43.95      (2.4%)       50.09      (2.7%)   
14.0% (   8% -   19%)
             AndHighHigh       25.06      (1.0%)       28.58      (0.9%)   
14.0% (  12% -   16%)
            HighSpanNear        5.19      (3.5%)        6.03      (4.4%)   
16.3% (   8% -   25%)
               MedPhrase      129.83      (4.8%)      151.45      (6.8%)   
16.7% (   4% -   29%)
                 Prefix3       27.68      (1.1%)       34.91      (1.2%)   
26.1% (  23% -   28%)
           OrNotHighHigh       15.03      (2.1%)       19.11      (4.1%)   
27.1% (  20% -   34%)
                 MedTerm       26.60      (1.5%)       35.40      (3.1%)   
33.1% (  27% -   38%)
                Wildcard        9.05      (1.8%)       12.15      (2.0%)   
34.2% (  29% -   38%)
            OrHighNotMed       14.68      (1.6%)       19.84      (3.4%)   
35.2% (  29% -   40%)
           OrHighNotHigh        8.79      (2.0%)       11.91      (4.3%)   
35.5% (  28% -   42%)
                HighTerm       19.85      (1.7%)       26.98      (3.3%)   
35.9% (  30% -   41%)
                 LowTerm      132.41      (1.4%)      180.49      (3.9%)   
36.3% (  30% -   42%)
               OrHighMed       11.67      (1.5%)       16.36      (3.4%)   
40.2% (  34% -   45%)
              OrHighHigh        3.70      (1.9%)        5.34      (4.4%)   
44.3% (  37% -   51%)
            OrHighNotLow        8.89      (1.7%)       12.90      (4.3%)   
45.2% (  38% -   52%)
               OrHighLow        5.20      (1.7%)        7.57      (4.2%)   
45.7% (  39% -   52%)
                  IntNRQ        2.61      (1.1%)        4.03      (1.7%)   
54.4% (  50% -   57%)
{noformat}


> explore per-dimension fixed-width ordinal encoding
> --------------------------------------------------
>
>                 Key: LUCENE-5308
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5308
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>         Attachments: LUCENE-5308.patch
>
>
> I've been testing performance of Solr vs Lucene facets, and one area
> where Solr's "fcs" method shines (low RAM, high faceting perf) is in
> low-cardinality dimensions.
> I suspect the gains are because with the field-cache entries the ords
> are encoded in "column-stride" form, and are private to that dim (vs
> facet module's shared ord space).
> So I thought about whether we could do something like this in the
> facet module ...
> I.e., if we know certain documents will have a specific set of
> single-valued dimensions, we can pick an encoding format for the
> per-doc byte[] "globally" for all such documents, and use private ord
> space per-dimension to improve compression.
> The basic idea is to pre-assign up-front (before the segment is
> written) which bytes belong to which dim.  E.g., date takes bytes 0-1
> (<= than 65536 unique labels), imageCount takes byte 2 (<= 256
> unique labels), username takes bytes 3-6 (<= 16.8 M unique labels),
> etc.  This only works for single-valued dims, and only works if all
> docs (or at least an identifiable subset?) have all dims.
> To test this idea, I made a hacked up prototype patch; it has tons of
> limitations so we clearly can't commit it, but I was able to test full
> wikipedia en with 6 facet dims (date, username, refCount, imageCount,
> sectionCount, subSectionCount, subSubSectionCount).
> Trunk (base) requires 181 MB of net doc values to hold the facet ords,
> while the patch requires 183 MB.
> Perf:
> {noformat}
> Report after iter 19:
>                     Task    QPS base      StdDev    QPS comp      StdDev      
>           Pct diff
>                  Respell       54.30      (3.1%)       54.02      (2.7%)   
> -0.5% (  -6% -    5%)
>          MedSloppyPhrase        3.58      (5.6%)        3.60      (6.0%)    
> 0.6% ( -10% -   12%)
>             OrNotHighLow       63.58      (6.8%)       64.03      (6.9%)    
> 0.7% ( -12% -   15%)
>         HighSloppyPhrase        3.80      (7.4%)        3.84      (7.1%)    
> 1.1% ( -12% -   16%)
>              LowSpanNear        8.93      (3.5%)        9.09      (4.6%)    
> 1.8% (  -6% -   10%)
>                LowPhrase       12.15      (6.4%)       12.43      (7.2%)    
> 2.3% ( -10% -   17%)
>               AndHighLow      402.54      (1.4%)      425.23      (2.3%)    
> 5.6% (   1% -    9%)
>          LowSloppyPhrase       39.53      (1.6%)       42.01      (1.9%)    
> 6.3% (   2% -    9%)
>              MedSpanNear       26.54      (2.8%)       28.39      (3.6%)    
> 7.0% (   0% -   13%)
>               HighPhrase        4.01      (8.1%)        4.30      (9.7%)    
> 7.4% (  -9% -   27%)
>                   Fuzzy2       44.01      (2.3%)       47.43      (1.8%)    
> 7.8% (   3% -   12%)
>             OrNotHighMed       32.64      (4.7%)       35.22      (5.5%)    
> 7.9% (  -2% -   19%)
>                   Fuzzy1       62.24      (2.1%)       67.35      (1.9%)    
> 8.2% (   4% -   12%)
>                MedPhrase      129.06      (4.9%)      141.14      (6.2%)    
> 9.4% (  -1% -   21%)
>               AndHighMed       27.71      (0.7%)       30.32      (1.1%)    
> 9.4% (   7% -   11%)
>             HighSpanNear        5.15      (3.5%)        5.63      (4.2%)    
> 9.5% (   1% -   17%)
>              AndHighHigh       24.98      (0.7%)       27.89      (1.1%)   
> 11.7% (   9% -   13%)
>            OrNotHighHigh       15.13      (2.0%)       17.90      (2.6%)   
> 18.3% (  13% -   23%)
>                 Wildcard        9.06      (1.4%)       10.85      (2.6%)   
> 19.8% (  15% -   24%)
>            OrHighNotHigh        8.84      (1.8%)       10.64      (2.6%)   
> 20.3% (  15% -   25%)
>               OrHighHigh        3.73      (1.6%)        4.51      (2.4%)   
> 20.9% (  16% -   25%)
>                OrHighLow        5.22      (1.5%)        6.34      (2.5%)   
> 21.4% (  17% -   25%)
>             OrHighNotLow        8.94      (1.6%)       10.95      (2.5%)   
> 22.5% (  18% -   26%)
>                  Prefix3       27.61      (1.2%)       33.90      (2.3%)   
> 22.8% (  19% -   26%)
>                OrHighMed       11.72      (1.6%)       14.56      (2.3%)   
> 24.3% (  20% -   28%)
>             OrHighNotMed       14.74      (1.5%)       18.34      (2.2%)   
> 24.5% (  20% -   28%)
>                  MedTerm       26.37      (1.2%)       32.85      (2.7%)   
> 24.6% (  20% -   28%)
>                   IntNRQ        2.61      (1.2%)        3.25      (3.0%)   
> 24.7% (  20% -   29%)
>                 HighTerm       19.69      (1.3%)       25.33      (3.0%)   
> 28.7% (  23% -   33%)
>                  LowTerm      131.50      (1.3%)      170.49      (3.0%)   
> 29.7% (  25% -   34%)
> {noformat}
> I think the gains are sizable, and the increase in index size quite
> minor (in another test with fewer dims I saw the index size get a bit
> smaller) ... at least for this specific test.
> However, finding a clean solution here will be tricky...



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5308) explore per-dimension fixed-width ordinal encoding

Reply via email to