[jira] [Commented] (LUCENE-4609) Write a PackedIntsEncoder/Decoder for facets

Gilad Barkai (JIRA) Wed, 19 Dec 2012 11:35:14 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13536347#comment-13536347
 ]


Gilad Barkai commented on LUCENE-4609:
--------------------------------------

bq. Do you encode the gaps or the straight up ords?

Well, It's a 'end point' encoder, meaning it encodes whatever values are 
received directly to the output.
One could create an encoder as: {{new SortingIntEncoder(new 
UniqueValuesIntEncoder(new DGapIntEncoder(new PackedEncoder())))}}, so the 
values the packed encoder would receive are already after sort, unique and 
dgap. 

{quote}
This is PForDelta compression (the outliers are encoded separately) I think? We 
can test it and see if it helps ... but we weren't so happy with it for 
encoding postings (it adds complexity, slows down decode, and didn't seem to 
help that much in reducing the size).
{quote}

PForDelta is indeed slower. But we've met scenarios in which most dgaps are 
small - hence the NOnes, and the Four/Eight Flag encoders. If indeed most 
values are small, say, could fit in 4 bits, but there's also one or two larger 
values which would require 12 or 14 bits, we could benefit hear greatly.
This is all relevant only where there are large amount of categories per 
document.

bq. it seems like you are writing the full header per field
That is right. To be frank, I'm not 100% sure what {{PackedInts}} does.. nor 
how large its header is.. 
But I think perhaps some header per doc is required anyway? For bits-per-value 
smaller than the size of a byte, there's a need to know how many bits should be 
left out from the last read byte. 

I started writing my own version as a first step toward the 'mixed' version, in 
which a 1 byte header is written, that contained both the the 'bits per value' 
as the first 5 bits, and the amount of extra bits in the last 3 bits. I'm still 
playing with it, hope to share it soon.
                
> Write a PackedIntsEncoder/Decoder for facets
> --------------------------------------------
>
>                 Key: LUCENE-4609
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4609
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/facet
>            Reporter: Shai Erera
>            Priority: Minor
>         Attachments: LUCENE-4609.patch
>
>
> Today the facets API lets you write IntEncoder/Decoder to encode/decode the 
> category ordinals. We have several such encoders, including VInt (default), 
> and block encoders.
> It would be interesting to implement and benchmark a 
> PackedIntsEncoder/Decoder, with potentially two variants: (1) receives 
> bitsPerValue up front, when you e.g. know that you have a small taxonomy and 
> the max value you can see and (2) one that decides for each doc on the 
> optimal bitsPerValue, writes it as a header in the byte[] or something.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4609) Write a PackedIntsEncoder/Decoder for facets

Reply via email to