Re: [jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField

Erick Erickson Thu, 10 Dec 2009 09:40:35 -0800

Mike:

Which of these do you think this patch *should* address before committing?
Just the last two?
As many as Christian has energy for <G>?


On Thu, Dec 10, 2009 at 12:24 PM, Michael McCandless (JIRA) <[email protected]
> wrote:

>
>    [
> https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788798#action_12788798]
>
> Michael McCandless commented on LUCENE-2133:
> --------------------------------------------
>
> This patch is a good step forward -- it associates the cache directly
> with IndexReader, where it belongs; it cleanly decouples cache from
> reader (vs the hack we have today with IndexReader.getFieldCacheKey),
> so that eg cloned readers can share the same cache; it also preserves
> back compat, which is quite a stunning accomplishment :)
>
> But... there are many more things I don't like about FieldCache, that
> I'm not sure (?) the patch addresses:
>
>  * Uninversion to derive eg an int[] is horribly slow, compared to
>    say loading the pre-encoded binary ints from disk, created during
>    indexing.  Ie, I think, if we are going to overhaul FieldCache
>    API, we should somehow make LUCENE-1231 feasible.
>
>  * There's no pluggability to customize where the int[] comes from
>    for a given field -- most you can do is provide your own IntParser
>    that the uninverter uses.  EG the fact that the patch had to
>    "move" FieldCacheRange/TermsFilter down, is strange -- these
>    filters (and in general any future "cache consumers") should live
>    in oal.search, but simply pull from a different int[] source,
>    somehow.
>
>  * Error checking of single-value-per-field (for StringIndex) is
>    dangerous, today -- it's intermittent, and, it's an unchecked
>    exception.  We should probably just remove the exception, or,
>    maybe make it checked.  Actually I'll go open a new issue for
>    this.  We should simply fix this.
>
>  * Single-value-per-field limitation (though, that's a nice to have,
>    future improvement)
>
>  * Even accepting the single-value-per-field limitaiton, we should
>    allow multiple values per field during uninversion, w/
>    customizable logic about which value is kept as the single one
>    (there is an issue open for this I think).  This really should be
>    some sort of added extensibility to whatever class drives
>    uninversion...
>
>  * The terror of accidentally asking for the array at the top-level
>    of Multi/DirReader.  I think this shouldn't even be allowed, at
>    least not easily, ie Dir/MultiReader.getIndexCache should throw
>    UOE.  If we really wanted to, we could provide sugar methods in
>    maybe ReaderUtil to "glom N int[]'s into a new int[]".  But it
>    should be named something scary :) Then we wouldn't need any
>    insanity checking.
>
>  * No control over caching policy (cannot evict things)
>
>  * If we make field cache flexible enough, we could maybe fold norms
>    & deleted docs into it (would be a separate future issue to
>    actually do so...).
>
> Some other questions about the patch:
>
>  * Consumers of the cache API (the sort comparator,
>    FieldCacheTerms/RangeFilter, and any other future users of "the
>    field cache") shouldn't have to move down into fields sub-package?
>
>  * It's a little strange that the term vectors & fields reader also
>    got pulled into the cache?
>
>
> > [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
> > -------------------------------------------------------------------------
> >
> >                 Key: LUCENE-2133
> >                 URL: https://issues.apache.org/jira/browse/LUCENE-2133
> >             Project: Lucene - Java
> >          Issue Type: Improvement
> >          Components: Search
> >    Affects Versions: 2.9.1, 3.0
> >            Reporter: Christian Kohlschütter
> >         Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch,
> LUCENE-2133.patch, LUCENE-2133.patch
> >
> >
> > Hi all,
> > up to the current version Lucene contains a conceptual flaw, that is the
> FieldCache. The FieldCache is a singleton which is supposed to cache certain
> information for every IndexReader that is currently open
> > The FieldCache is flawed because it is incorrect to assume that:
> > 1. one IndexReader instance equals one index. In fact, there can be many
> clones (of SegmentReader) or decorators (FilterIndexReader) which all access
> the very same data.
> > 2. the cache information remains valid for the lifetime of an
> IndexReader. In fact, some IndexReaders may be reopen()'ed and thus they may
> contain completely different information.
> > 3. all IndexReaders need the same type of cache. In fact, because of the
> limitations imposed by the singleton construct there was no implementation
> other than FieldCacheImpl.
> > Furthermore, FieldCacheImpl and FieldComparator are bloated by several
> static inner-classes that could be moved to package level.
> > There have been a few attempts to improve FieldCache, namely LUCENE-831,
> LUCENE-1579 and LUCENE-1749, but the overall situation remains the same:
> There is a central registry for assigning Caches to IndexReader instances.
> > I now propose the following:
> > 1. Obsolete FieldCache and FieldCacheKey and provide index-specific,
> extensible cache instances ("IndexCache"). IndexCaches provide common
> caching functionality for all IndexReaders and may be extended (for example,
> SegmentReader would have a SegmentReaderIndexCache and store different data
> than a regular IndexCache)
> > 2. Add the index-specific field cache (IndexFieldCache) to the
> IndexCache. IndexFieldCache is an interface just like FieldCache and may
> support different implementations.
> > 3. The IndexCache instances may be flushed/closed by the associated
> IndexReaders whenever necessary.
> > 4. Obsolete FieldCacheSanityChecker because no more "insanities" are
> expected (or at least, they do not impact the overall performance)
> > 5. Refactor FieldCacheImpl and the related classes (FieldComparator,
> SortField)
> > I have provided an patch which takes care of all these issues. It passes
> all JUnit tests.
> > The patch is quite large, admittedly, but the change required several
> modifications and some more to preserve backwards-compatibility.
> Backwards-compatibility is preserved by moving some of the updated
> functionality in the package org.apache.lucene.search.fields (field
> comparators and parsers, SortField) while adding wrapper instances and
> keeping old code in org.apache.lucene.search.
> > In detail and besides the above mentioned improvements, the following is
> provided:
> > 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are
> moved from SegmentReader to SegmentReaderIndexCache.
> > 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the
> close() method to all registered instances by calling an onClose() method
> with the threads' instances.
> > 3. Analyzer.close now may throw an IOException (this already is covered
> by java.io.Closeable).
> > 4. A change to Collector: allow IndexCache instead of IndexReader being
> passed to setNextReader()
> > 5. SortField's numeric types have been replaced by direct assignments of
> FieldComparatorSource. This removes the "switch" statements and the
> possibility to throw IllegalArgumentExceptions because of unsupported type
> values.
> > The following classes have been deprecated and replaced by new classes in
> org.apache.lucene.search.fields:
> > - FieldCacheRangeFilter (=> IndexFieldCacheRangeFilter)
> > - FieldCacheTermsFilter (=> IndexFieldCacheTermsFilter)
> > - FieldCache (=> IndexFieldCache)
> > - FieldCacheImpl (=> IndexFieldCacheImpl)
> > - all classes in FieldCacheImpl (=> several package-level classes)
> > - all subclasses of FieldComparator (=> several package-level classes)
> > Final notes:
> > - The patch would be simpler if no backwards compatibility was necessary.
> The Lucene community has to decide which classes/methods can immediately be
> removed, which ones later, which not at all. Whenever new classes depend on
> the old ones, an appropriate notice exists in the javadocs.
> > - The patch introduces a new, deprecated class
> IndexFieldCacheSanityChecker.java which is just there for testing purposes,
> to show that no sanity checks are necessary any longer. This class may be
> removed at any time.
> > - I expect that the patch does not impact performance. On the contrary,
> as the patch removes a few unnecessary checks we might even see a slight
> speedup. No benchmarking has been done so far, though.
> > - I have tried to preserve the existing functionality wherever possible
> and to focus on the class/method structure only. We certainly may improve
> the caches' behavior, but this out of scope for this patch.
> > - The refactoring finally makes the high duplication of code visible: For
> all supported atomic types (byte, double, float, int, long, short) three
> classes each are required: *Cache, *Comparator and *Parser. I think that
> further simplification might be possible (maybe using Java generics?), but I
> guess the current patch is large enough for now.
> > Cheers,
> > Christian
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: [jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField

Reply via email to