[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

Michael McCandless (JIRA) Fri, 12 Dec 2008 04:14:12 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655988#action_12655988
 ]


Michael McCandless commented on LUCENE-831:
-------------------------------------------


{quote}
> At present, KS only caches the docID -> ord map as an array. It builds that
> array by iterating over the terms in the sort field's Lexicon and mapping the
> docIDs from each term's posting list.
{quote}

OK, that corresponds to the "order" array in Lucene's
FieldCache.StringIndex class.

{quote}
> Building the docID -> ord array is straightforward for a single-segment
> SegLexicon. The multi-segment case requires that several SegLexicons be
> collated using a priority queue. In KS, there's a MultiLexicon class which
> handles this; I don't believe that Lucene has an analogous class.
{quote}

Lucene achieves the same functionality by using a MultiReader to read
the terms in order (which uses MultiSegmentReader.MultiTermEnum, which
uses a pqueue under the hood) and building up StringIndex from that.
It's very costly.

{quote}
> Relying on the docID -> ord array alone works quite well until you get to the
> MultiSearcher case. As you know, at that point you need to be able to
> retrieve the actual field values from the ordinal numbers, so that you can
> compare across multiple searchers (since the ordinal values are meaningless).
{quote}

Right, and we are trying to move towards pushing searcher down to the
segment.  Then we can use the per-segment ords for within-segment
collection, and then the real values for merging the separate pqueues
at the end (but, initial results from LUCENE-1483 show that collecting
N queues then merging in the end adds ~20% slowdown for N = 100
segments).

{quote}
> Lex_Seek_By_Num(lexicon, term_num);
> field_val = Lex_Get_Term(lexicon);
> 
> The problem is that seeking by ordinal value on a MultiLexicon iterator
> requires a gnarly implementation and is very expensive. I got it working, but
> I consider it a dead-end design and a failed experiment.
{quote}

OK.

{quote}
> The planned replacement for these iterator-based quasi-FieldCaches involves
> several topics of recent discussion:
> 
> 1) A "keyword" field type, implemented using a format similar to what Nate
> and I came up with for the lexicon index.
> 2) Write per-segment docID -> ord maps at index time for sort fields.
> 3) Memory mapping.
> 4) Segment-centric searching.
> 
> We'd mmap the pre-composed docID -> ord map and use it for intra-segment
> sorting. The keyword field type would be implemented in such a way that we'd
> be able to mmap a few files and get a per-segment field cache, which we'd then
> use to sort hits from multiple segments.
{quote}

OK so your "keyword" field type would expose random-access to field
values by docID, to be used to merge the N segments' pqueues into a
single final pqueue?

The alternative is to use iterator but pull the values into your
pqueues when they are inserted.  The benefit is iterator-only
exposure, but the downside is likely higher net cost of insertion.
And if the "assumption" is these fields can generally be ram resident
(explicitly or via mmap), then the net benefit of iterator-only API is
not high.


> Complete overhaul of FieldCache API/Implementation
> --------------------------------------------------
>
>                 Key: LUCENE-831
>                 URL: https://issues.apache.org/jira/browse/LUCENE-831
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Hoss Man
>             Fix For: 3.0
>
>         Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, 
> fieldcache-overhaul.diff, fieldcache-overhaul.diff, 
> LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, 
> LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, 
> LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch
>
>
> Motivation:
> 1) Complete overhaul the API/implementation of "FieldCache" type things...
>     a) eliminate global static map keyed on IndexReader (thus
>         eliminating synch block between completley independent IndexReaders)
>     b) allow more customization of cache management (ie: use 
>         expiration/replacement strategies, disk backed caches, etc)
>     c) allow people to define custom cache data logic (ie: custom
>         parsers, complex datatypes, etc... anything tied to a reader)
>     d) allow people to inspect what's in a cache (list of CacheKeys) for
>         an IndexReader so a new IndexReader can be likewise warmed. 
>     e) Lend support for smarter cache management if/when
>         IndexReader.reopen is added (merging of cached data from subReaders).
> 2) Provide backwards compatibility to support existing FieldCache API with
>     the new implementation, so there is no redundent caching as client code
>     migrades to new API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

Reply via email to