[
https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775420#action_12775420
]
Fuad Efendi edited comment on LUCENE-1990 at 11/10/09 4:10 PM:
---------------------------------------------------------------
Specifically for FieldCache, let's see... suppose Field may have 8 different
values, and number of documents is high.
{code}
Value0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 ...
Value1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 ...
Value2 0 0 0 1 1 0 0 0 0 0 0 0 0 0 ...
Value3 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ...
Value4 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ...
Value5 0 0 0 0 0 1 0 0 0 0 1 0 1 0 ...
Value6 0 0 0 0 0 0 0 1 0 1 0 0 0 0 ...
Value7 0 0 0 0 0 0 0 0 1 0 0 0 0 1 ...
{code}
- represented as Matrix (or as a Vector); for instance, first row means that
Document1 and Document8 have Value0.
And now, if we go "horizontally" we will end up with 8 arrays of int[]. What if
we go "vertically"? Field could be encoded as 3-bit (8 different values).
CONSTRAINT: specifically for FieldCache, each Column must have the only "1".
And we can end with array of 3-bit values storing position in a column! Size of
array is IndexReader.maxDoc().
hope I am reinventing bycycle :)
P.S.
Of course each solution has pros and cons, I am trying to focus on FieldCache
specific use cases.
1. For a given document ID, find a value for a field
2. For a given query results, sort it by a field values
3. For a given query results, count "facet" for each field value
I don't think such naive compression is slower than abstract int[] arrays...
and we need to change public API of field cache too: if method returns int[] we
are not saving any RAM.
Better is to compare with SOLR use cases and to make API closer to real
requirements; SOLR operates with some bitsets instead of arrays...
was (Author: funtick):
Specifically for FieldCache, let's see... suppose Field may have 8
different values, and number of documents is high.
{code}
Value0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 ...
Value1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 ...
Value2 0 0 0 1 1 0 0 0 0 0 0 0 0 0 ...
Value3 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ...
Value4 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ...
Value5 0 0 0 0 0 1 0 0 0 0 1 0 1 0 ...
Value6 0 0 0 0 0 0 0 1 0 1 0 0 0 0 ...
Value7 0 0 0 0 0 0 0 0 1 0 0 0 0 1 ...
{code}
- represented as Matrix (or as a Vector); for instance, first row means that
Document1 and Document8 have Value0.
And now, if we go "horizontally" we will end up with 8 arrays of int[]. What if
we go "vertically"? Field could be encoded as 3-bit (8 different values).
CONSTRAINT: specifically for FieldCache, each Column must have the only "1".
And we can end with array of 3-bit values storing position in a column! Size of
array is IndexReader.maxDoc().
hope I am reinventing bycycle :)
> Add unsigned packed int impls in oal.util
> -----------------------------------------
>
> Key: LUCENE-1990
> URL: https://issues.apache.org/jira/browse/LUCENE-1990
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Priority: Minor
>
> There are various places in Lucene that could take advantage of an
> efficient packed unsigned int/long impl. EG the terms dict index in
> the standard codec in LUCENE-1458 could subsantially reduce it's RAM
> usage. FieldCache.StringIndex could as well. And I think "load into
> RAM" codecs like the one in TestExternalCodecs could use this too.
> I'm picturing something very basic like:
> {code}
> interface PackedUnsignedLongs {
> long get(long index);
> void set(long index, long value);
> }
> {code}
> Plus maybe an iterator for getting and maybe also for setting. If it
> helps, most of the usages of this inside Lucene will be "write once"
> so eg the set could make that an assumption/requirement.
> And a factory somewhere:
> {code}
> PackedUnsignedLongs create(int count, long maxValue);
> {code}
> I think we should simply autogen the code (we can start from the
> autogen code in LUCENE-1410), or, if there is an good existing impl
> that has a compatible license that'd be great.
> I don't have time near-term to do this... so if anyone has the itch,
> please jump!
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]