[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

Marvin Humphrey (JIRA) Fri, 12 Dec 2008 12:14:10 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12656150#action_12656150
 ]


Marvin Humphrey commented on LUCENE-831:
----------------------------------------

>> Building the docID -> ord array is straightforward for a single-segment
>> SegLexicon. The multi-segment case requires that several SegLexicons be
>> collated using a priority queue. In KS, there's a MultiLexicon class which
>> handles this; I don't believe that Lucene has an analogous class.
> 
> Lucene achieves the same functionality by using a MultiReader to read
> the terms in order (which uses MultiSegmentReader.MultiTermEnum, which
> uses a pqueue under the hood) and building up StringIndex from that.
> It's very costly.

Ah, you're right, that class is analogous.  The difference is that
MultiTermEnum doesn't implement seek(), let alone seekByNum().  I was pretty
sure you wouldn't have bothered, since by loading the actual term values into
an array you eliminate the need for seeking the iterator.

> OK so your "keyword" field type would expose random-access to field
> values by docID, 

Yes.  There would be three files for each keyword field in a segment.

  * docID -> ord map.  A stack of i32_t, one per doc.
  * Character data.  Each unique field value would be stored as uncompressed
    UTF-8, sorted lexically (by default).
  * Term offsets.  A stack of i64_t, one per term plus one, demarcating the 
    term text boundaries in the character data file.

Assuming that we've mmap'd those files -- or slurped them -- here's the
function to find the keyword value associated with a doc num:

{code}
void
KWField_Look_Up(KeyWordField *self, i32_t doc_num, ViewCharBuf *target)
{
    if (doc_num > self->max_doc) {
        CONFESS("Doc num out of range: %u32 %u32", 
    }
    else {
        i64_t offset      = self->offsets[doc_num];
        i64_t next_offset = self->offsets[doc_num + 1];
        i64_t len         = next_offset - offset;
        ViewCB_Assign_Str(target, self->chardata + offset, len);
    }
}
{code}

I'm not sure whether IndexReader.fetchDoc() should retrieve the values for
keyword fields by default, but I lean towards yes.  The locality isn't ideal,
but I don't think it'll be bad enough to contemplate storing keyword values
redundantly alongside the other stored field values.

> to be used to merge the N segments' pqueues into a
> single final pqueue?

Yes, although I think you only need one two priority queues total: one
dedicated to iterating intra-segment, which gets emptied out after each
seg into the other, final queue.

> The alternative is to use iterator but pull the values into your
> pqueues when they are inserted. The benefit is iterator-only
> exposure, but the downside is likely higher net cost of insertion.
> And if the "assumption" is these fields can generally be ram resident
> (explicitly or via mmap), then the net benefit of iterator-only API is
> not high.

If I understand where you're going, you'd like to apply the design of the
deletions iterator to this problem?

For that to work, we'd need to store values for each document, rather than
only unique values... right?  And they couldn't be stored in sorted order,
because we aren't pre-sorting the docs in the segment according to the value
of a keyword field -- which means string diffs don't help.  You'd have a
single file, with each doc's values encoded as a vbyte byte-count followed by
UTF-8 character data.

> Complete overhaul of FieldCache API/Implementation
> --------------------------------------------------
>
>                 Key: LUCENE-831
>                 URL: https://issues.apache.org/jira/browse/LUCENE-831
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Hoss Man
>             Fix For: 3.0
>
>         Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, 
> fieldcache-overhaul.diff, fieldcache-overhaul.diff, 
> LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, 
> LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, 
> LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch
>
>
> Motivation:
> 1) Complete overhaul the API/implementation of "FieldCache" type things...
>     a) eliminate global static map keyed on IndexReader (thus
>         eliminating synch block between completley independent IndexReaders)
>     b) allow more customization of cache management (ie: use 
>         expiration/replacement strategies, disk backed caches, etc)
>     c) allow people to define custom cache data logic (ie: custom
>         parsers, complex datatypes, etc... anything tied to a reader)
>     d) allow people to inspect what's in a cache (list of CacheKeys) for
>         an IndexReader so a new IndexReader can be likewise warmed. 
>     e) Lend support for smarter cache management if/when
>         IndexReader.reopen is added (merging of cached data from subReaders).
> 2) Provide backwards compatibility to support existing FieldCache API with
>     the new implementation, so there is no redundent caching as client code
>     migrades to new API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

Reply via email to