[jira] Issue Comment Edited: (LUCENE-1990) Add unsigned packed int impls in oal.util

Fuad Efendi (JIRA) Tue, 10 Nov 2009 08:12:01 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775420#action_12775420
 ]


Fuad Efendi edited comment on LUCENE-1990 at 11/10/09 4:10 PM:
---------------------------------------------------------------

Specifically for FieldCache, let's see... suppose Field may have 8 different 
values, and number of documents is high.

{code}
Value0  0  1  0  0  0  0  0   0  1  0  0  0  0  0 ...  
Value1  1  0  1  0  0  0  0   0  0  0  0  0  0  0 ...  
Value2  0  0  0  1  1  0  0   0  0  0  0  0  0  0 ...  
Value3  0  0  0  0  0  0  0   0  0  0  0  1  0  0 ...  
Value4  0  0  0  0  0  0  1   0  0  0  0  0  0  0 ...  
Value5  0  0  0  0  0  1  0   0  0  0  1  0  1  0 ...  
Value6  0  0  0  0  0  0  0   1  0  1  0  0  0  0 ...  
Value7  0  0  0  0  0  0  0   0  1  0  0  0  0  1 ...  
{code}

- represented as Matrix (or as a Vector); for instance, first row means that 
Document1 and Document8 have Value0.

And now, if we go "horizontally" we will end up with 8 arrays of int[]. What if 
we go "vertically"? Field could be encoded as 3-bit (8 different values).

CONSTRAINT: specifically for FieldCache, each Column must have the only "1".

And we can end with array of 3-bit values storing position in a column! Size of 
array is IndexReader.maxDoc().


hope I am reinventing bycycle :)

P.S.
Of course each solution has pros and cons, I am trying to focus on FieldCache 
specific use cases.

1. For a given document ID, find a value for a field
2. For a given query results, sort it by a field values
3. For a given query results, count "facet" for each field value

I don't think such naive compression is slower than abstract int[] arrays... 
and we need to change public API of field cache too: if method returns int[] we 
are not saving any RAM.

Better is to compare with SOLR use cases and to make API closer to real 
requirements; SOLR operates with some bitsets instead of arrays...

      was (Author: funtick):
    Specifically for FieldCache, let's see... suppose Field may have 8 
different values, and number of documents is high.

{code}
Value0  0  1  0  0  0  0  0   0  1  0  0  0  0  0 ...  
Value1  1  0  1  0  0  0  0   0  0  0  0  0  0  0 ...  
Value2  0  0  0  1  1  0  0   0  0  0  0  0  0  0 ...  
Value3  0  0  0  0  0  0  0   0  0  0  0  1  0  0 ...  
Value4  0  0  0  0  0  0  1   0  0  0  0  0  0  0 ...  
Value5  0  0  0  0  0  1  0   0  0  0  1  0  1  0 ...  
Value6  0  0  0  0  0  0  0   1  0  1  0  0  0  0 ...  
Value7  0  0  0  0  0  0  0   0  1  0  0  0  0  1 ...  
{code}

- represented as Matrix (or as a Vector); for instance, first row means that 
Document1 and Document8 have Value0.

And now, if we go "horizontally" we will end up with 8 arrays of int[]. What if 
we go "vertically"? Field could be encoded as 3-bit (8 different values).

CONSTRAINT: specifically for FieldCache, each Column must have the only "1".

And we can end with array of 3-bit values storing position in a column! Size of 
array is IndexReader.maxDoc().


hope I am reinventing bycycle :)

  
> Add unsigned packed int impls in oal.util
> -----------------------------------------
>
>                 Key: LUCENE-1990
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1990
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Priority: Minor
>
> There are various places in Lucene that could take advantage of an
> efficient packed unsigned int/long impl.  EG the terms dict index in
> the standard codec in LUCENE-1458 could subsantially reduce it's RAM
> usage.  FieldCache.StringIndex could as well.  And I think "load into
> RAM" codecs like the one in TestExternalCodecs could use this too.
> I'm picturing something very basic like:
> {code}
> interface PackedUnsignedLongs  {
>   long get(long index);
>   void set(long index, long value);
> }
> {code}
> Plus maybe an iterator for getting and maybe also for setting.  If it
> helps, most of the usages of this inside Lucene will be "write once"
> so eg the set could make that an assumption/requirement.
> And a factory somewhere:
> {code}
>   PackedUnsignedLongs create(int count, long maxValue);
> {code}
> I think we should simply autogen the code (we can start from the
> autogen code in LUCENE-1410), or, if there is an good existing impl
> that has a compatible license that'd be great.
> I don't have time near-term to do this... so if anyone has the itch,
> please jump!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Issue Comment Edited: (LUCENE-1990) Add unsigned packed int impls in oal.util

Reply via email to