[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField

Michael McCandless (JIRA) Fri, 11 Dec 2009 03:29:44 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789249#action_12789249
 ]


Michael McCandless commented on LUCENE-2133:
--------------------------------------------

bq.  I was just a little scared about the plethora of classes in that package. 

Let's separately worry about that?  I think consumers of the cache API
shouldn't have to move.

bq. Caching fields is only one functionality provided by IndexCache. I took the 
opportunity to demonstrate caching something other than fields. You now save a 
few bytes per SegmentReader instance because of that change.

OK but let's also do this separately (Earwin I think is working on a
patch for componentizing SR, which'd be great).  FieldCache, "just"
one of IndexReaders components, is hard enough to fix; let's just
focus on it :)

Also, as hackish as it is, the getFieldCacheKey() works well.  The
need for insanity checking is annoying, so let's just disallow getting
cached @ Dir/MultiReader level (and offer sugar APIs if really
needed).

And as annoying as the static FieldCache is, LUCENE-2135 plugs yet
another hold in the dike, ie, now entries are purged on close, and you
can also explicitly purge a given IndexReader.

I don't think we should wholesale move the cache APIs just for the
sake of moving them.  That's alot of API noise; we need some real
improvements to justify making users switch.

bq. Since there are indeed several people working on things that are closely 
related, esp. LUCENE-2135) a branch would make sense

bq. Can we summarize all the open points for a "worthy" cache overhaul, close 
LUCENE-831 in favor of LUCENE-2133 and continue working here?

I don't think we need a branch just yet.  (LUCENE-2135 is a very
standalone change from this issue).  And, before closing other issues,
committing this one as a start, etc, we really first need to reach
some consensus on what even to do, here.

I really want to see us first fix FieldCache's deep problems, then
concern ourselves with where the resulting API will go / what it looks
like.  Ie, we're putting the horse before the cart, here, so far...

EG, what [minimal] changes could we make to allow someone to plugin
their own values source?  If we make this:

{code}
class FieldValues {
  public FIELD_TYPE getType();
  public int[] getInts() {};
 // and all the other types...
}

class FieldValuesSource implements Closeable {
  public FieldValues getValues(FieldInfo field);
  public void close();
}
{code}

We could then make an UninverterFieldValuesSource, subclassing
PerFieldValueSource, that takes IndexReader to its ctor, lets your
register parsers by field.  And it'd allow customization beyond simply
changing the parser, eg to control behavior of multi-valued fields,
stopping early (NRQ does this), etc.

We'd have CachedFieldValueSource that'd wrap any FieldValuesSource and
cache, allowing you to subclass it if you want to do eviction, etc.
It'd by default implement the same simplistic policy we have today --
retain everything, until close at which point you discard everything.

All consumers of field cache (sorting by field,
FieldCacheRange/TermsFilter, function queries, etc.) should all allow
me to pass in my own FieldValuesSource.  IndexReader would let me
customize, but would default to the cached uninversion source.

I could then in theory [external to Lucene] make a FieldValueSource
that slurped stuff from disk, and I could create LUCENE-1231 (= CSF)
outside of Lucene.  [And, when we build CSF inside of Lucene it'd also
just be another source].

Something along these lines maybe....?


> [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-2133
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2133
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.9.1, 3.0
>            Reporter: Christian Kohlschütter
>         Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, 
> LUCENE-2133.patch, LUCENE-2133.patch
>
>
> Hi all,
> up to the current version Lucene contains a conceptual flaw, that is the 
> FieldCache. The FieldCache is a singleton which is supposed to cache certain 
> information for every IndexReader that is currently open
> The FieldCache is flawed because it is incorrect to assume that:
> 1. one IndexReader instance equals one index. In fact, there can be many 
> clones (of SegmentReader) or decorators (FilterIndexReader) which all access 
> the very same data.
> 2. the cache information remains valid for the lifetime of an IndexReader. In 
> fact, some IndexReaders may be reopen()'ed and thus they may contain 
> completely different information.
> 3. all IndexReaders need the same type of cache. In fact, because of the 
> limitations imposed by the singleton construct there was no implementation 
> other than FieldCacheImpl.
> Furthermore, FieldCacheImpl and FieldComparator are bloated by several static 
> inner-classes that could be moved to package level.
> There have been a few attempts to improve FieldCache, namely LUCENE-831, 
> LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: 
> There is a central registry for assigning Caches to IndexReader instances.
> I now propose the following:
> 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, 
> extensible cache instances ("IndexCache"). IndexCaches provide common caching 
> functionality for all IndexReaders and may be extended (for example, 
> SegmentReader would have a SegmentReaderIndexCache and store different data 
> than a regular IndexCache)
> 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. 
> IndexFieldCache is an interface just like FieldCache and may support 
> different implementations.
> 3. The IndexCache instances may be flushed/closed by the associated 
> IndexReaders whenever necessary.
> 4. Obsolete FieldCacheSanityChecker because no more "insanities" are expected 
> (or at least, they do not impact the overall performance)
> 5. Refactor FieldCacheImpl and the related classes (FieldComparator, 
> SortField) 
> I have provided an patch which takes care of all these issues. It passes all 
> JUnit tests.
> The patch is quite large, admittedly, but the change required several 
> modifications and some more to preserve backwards-compatibility. 
> Backwards-compatibility is preserved by moving some of the updated 
> functionality in the package org.apache.lucene.search.fields (field 
> comparators and parsers, SortField) while adding wrapper instances and 
> keeping old code in org.apache.lucene.search.
> In detail and besides the above mentioned improvements, the following is 
> provided:
> 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved 
> from SegmentReader to SegmentReaderIndexCache.
> 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the 
> close() method to all registered instances by calling an onClose() method 
> with the threads' instances.
> 3. Analyzer.close now may throw an IOException (this already is covered by 
> java.io.Closeable).
> 4. A change to Collector: allow IndexCache instead of IndexReader being 
> passed to setNextReader()
> 5. SortField's numeric types have been replaced by direct assignments of 
> FieldComparatorSource. This removes the "switch" statements and the 
> possibility to throw IllegalArgumentExceptions because of unsupported type 
> values.
> The following classes have been deprecated and replaced by new classes in 
> org.apache.lucene.search.fields:
> - FieldCacheRangeFilter (=> IndexFieldCacheRangeFilter)
> - FieldCacheTermsFilter (=> IndexFieldCacheTermsFilter)
> - FieldCache (=> IndexFieldCache)
> - FieldCacheImpl (=> IndexFieldCacheImpl)
> - all classes in FieldCacheImpl (=> several package-level classes)
> - all subclasses of FieldComparator (=> several package-level classes)
> Final notes:
> - The patch would be simpler if no backwards compatibility was necessary. The 
> Lucene community has to decide which classes/methods can immediately be 
> removed, which ones later, which not at all. Whenever new classes depend on 
> the old ones, an appropriate notice exists in the javadocs.
> - The patch introduces a new, deprecated class 
> IndexFieldCacheSanityChecker.java which is just there for testing purposes, 
> to show that no sanity checks are necessary any longer. This class may be 
> removed at any time.
> - I expect that the patch does not impact performance. On the contrary, as 
> the patch removes a few unnecessary checks we might even see a slight 
> speedup. No benchmarking has been done so far, though.
> - I have tried to preserve the existing functionality wherever possible and 
> to focus on the class/method structure only. We certainly may improve the 
> caches' behavior, but this out of scope for this patch.
> - The refactoring finally makes the high duplication of code visible: For all 
> supported atomic types (byte, double, float, int, long, short) three classes 
> each are required: *Cache, *Comparator and *Parser. I think that further 
> simplification might be possible (maybe using Java generics?), but I guess 
> the current patch is large enough for now.
> Cheers,
> Christian

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField

Reply via email to