[ 
https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-4609:
---------------------------------------

    Attachment: LUCENE-4609.patch

Patch, w/ a "custom" (not using our PackedInts APIs) packed ints 
encoder/decoder.  It only uses as many bytes as are necessary, and packs bpv & 
"leftoverBits" into a single byte header.

I tested on first 1M Wikipedia docs ... and performance is much worse than 
current default in trunk... admittedly it's not quite fair (trunk has 
specialized vInt/dGap decoder, but patch leaves dGap separate from packed int 
decode), and admittedly this decoder will be slower than the optimized 
oal.util.PackedInts ... but perf is so far off that I find it hard to believe 
PackedInts can match vInt even after optimizing.

Trunk gets these results:
{noformat}
                    Task    QPS base      StdDev    QPS comp      StdDev        
        Pct diff
                PKLookup      203.77      (1.8%)      202.25      (1.8%)   
-0.7% (  -4% -    2%)
                HighTerm       20.43      (1.8%)       20.53      (0.8%)    
0.5% (  -2% -    3%)
                 MedTerm       33.12      (1.7%)       33.30      (0.9%)    
0.5% (  -2% -    3%)
                 LowTerm       87.55      (3.0%)       88.59      (2.5%)    
1.2% (  -4% -    6%)
{noformat}

Patch gets this:
{noformat}
                    Task    QPS base      StdDev    QPS comp      StdDev        
        Pct diff
                HighTerm       10.82      (3.6%)       10.69      (4.4%)   
-1.2% (  -8% -    7%)
                 MedTerm       19.33      (3.2%)       19.10      (4.0%)   
-1.2% (  -8% -    6%)
                 LowTerm       67.75      (2.8%)       67.11      (3.0%)   
-0.9% (  -6% -    5%)
                PKLookup      196.49      (1.0%)      196.24      (1.9%)   
-0.1% (  -3% -    2%)
{noformat}

(NOTE: base/comp are the same in each run, so ignore the differences w/in each 
run (it's noise) and compare absolute across the two runs ... ie HighTerm gets 
~20.43 QPS with trunk but ~10.82 with patch).

Also: trunk took ~63 MB for the DV files while patch took ~84 MB.  Net/net I 
think postings compress better with PackedInts than facet ords (at least for 
these 9 facet fields I'm using in Wikipedia)...
                
> Write a PackedIntsEncoder/Decoder for facets
> --------------------------------------------
>
>                 Key: LUCENE-4609
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4609
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/facet
>            Reporter: Shai Erera
>            Priority: Minor
>         Attachments: LUCENE-4609.patch, LUCENE-4609.patch
>
>
> Today the facets API lets you write IntEncoder/Decoder to encode/decode the 
> category ordinals. We have several such encoders, including VInt (default), 
> and block encoders.
> It would be interesting to implement and benchmark a 
> PackedIntsEncoder/Decoder, with potentially two variants: (1) receives 
> bitsPerValue up front, when you e.g. know that you have a small taxonomy and 
> the max value you can see and (2) one that decides for each doc on the 
> optimal bitsPerValue, writes it as a header in the byte[] or something.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to