Mike: Which of these do you think this patch *should* address before committing? Just the last two? As many as Christian has energy for <G>?
On Thu, Dec 10, 2009 at 12:24 PM, Michael McCandless (JIRA) <j...@apache.org > wrote: > > [ > https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788798#action_12788798] > > Michael McCandless commented on LUCENE-2133: > -------------------------------------------- > > This patch is a good step forward -- it associates the cache directly > with IndexReader, where it belongs; it cleanly decouples cache from > reader (vs the hack we have today with IndexReader.getFieldCacheKey), > so that eg cloned readers can share the same cache; it also preserves > back compat, which is quite a stunning accomplishment :) > > But... there are many more things I don't like about FieldCache, that > I'm not sure (?) the patch addresses: > > * Uninversion to derive eg an int[] is horribly slow, compared to > say loading the pre-encoded binary ints from disk, created during > indexing. Ie, I think, if we are going to overhaul FieldCache > API, we should somehow make LUCENE-1231 feasible. > > * There's no pluggability to customize where the int[] comes from > for a given field -- most you can do is provide your own IntParser > that the uninverter uses. EG the fact that the patch had to > "move" FieldCacheRange/TermsFilter down, is strange -- these > filters (and in general any future "cache consumers") should live > in oal.search, but simply pull from a different int[] source, > somehow. > > * Error checking of single-value-per-field (for StringIndex) is > dangerous, today -- it's intermittent, and, it's an unchecked > exception. We should probably just remove the exception, or, > maybe make it checked. Actually I'll go open a new issue for > this. We should simply fix this. > > * Single-value-per-field limitation (though, that's a nice to have, > future improvement) > > * Even accepting the single-value-per-field limitaiton, we should > allow multiple values per field during uninversion, w/ > customizable logic about which value is kept as the single one > (there is an issue open for this I think). This really should be > some sort of added extensibility to whatever class drives > uninversion... > > * The terror of accidentally asking for the array at the top-level > of Multi/DirReader. I think this shouldn't even be allowed, at > least not easily, ie Dir/MultiReader.getIndexCache should throw > UOE. If we really wanted to, we could provide sugar methods in > maybe ReaderUtil to "glom N int[]'s into a new int[]". But it > should be named something scary :) Then we wouldn't need any > insanity checking. > > * No control over caching policy (cannot evict things) > > * If we make field cache flexible enough, we could maybe fold norms > & deleted docs into it (would be a separate future issue to > actually do so...). > > Some other questions about the patch: > > * Consumers of the cache API (the sort comparator, > FieldCacheTerms/RangeFilter, and any other future users of "the > field cache") shouldn't have to move down into fields sub-package? > > * It's a little strange that the term vectors & fields reader also > got pulled into the cache? > > > > [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField > > ------------------------------------------------------------------------- > > > > Key: LUCENE-2133 > > URL: https://issues.apache.org/jira/browse/LUCENE-2133 > > Project: Lucene - Java > > Issue Type: Improvement > > Components: Search > > Affects Versions: 2.9.1, 3.0 > > Reporter: Christian Kohlschütter > > Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, > LUCENE-2133.patch, LUCENE-2133.patch > > > > > > Hi all, > > up to the current version Lucene contains a conceptual flaw, that is the > FieldCache. The FieldCache is a singleton which is supposed to cache certain > information for every IndexReader that is currently open > > The FieldCache is flawed because it is incorrect to assume that: > > 1. one IndexReader instance equals one index. In fact, there can be many > clones (of SegmentReader) or decorators (FilterIndexReader) which all access > the very same data. > > 2. the cache information remains valid for the lifetime of an > IndexReader. In fact, some IndexReaders may be reopen()'ed and thus they may > contain completely different information. > > 3. all IndexReaders need the same type of cache. In fact, because of the > limitations imposed by the singleton construct there was no implementation > other than FieldCacheImpl. > > Furthermore, FieldCacheImpl and FieldComparator are bloated by several > static inner-classes that could be moved to package level. > > There have been a few attempts to improve FieldCache, namely LUCENE-831, > LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: > There is a central registry for assigning Caches to IndexReader instances. > > I now propose the following: > > 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, > extensible cache instances ("IndexCache"). IndexCaches provide common > caching functionality for all IndexReaders and may be extended (for example, > SegmentReader would have a SegmentReaderIndexCache and store different data > than a regular IndexCache) > > 2. Add the index-specific field cache (IndexFieldCache) to the > IndexCache. IndexFieldCache is an interface just like FieldCache and may > support different implementations. > > 3. The IndexCache instances may be flushed/closed by the associated > IndexReaders whenever necessary. > > 4. Obsolete FieldCacheSanityChecker because no more "insanities" are > expected (or at least, they do not impact the overall performance) > > 5. Refactor FieldCacheImpl and the related classes (FieldComparator, > SortField) > > I have provided an patch which takes care of all these issues. It passes > all JUnit tests. > > The patch is quite large, admittedly, but the change required several > modifications and some more to preserve backwards-compatibility. > Backwards-compatibility is preserved by moving some of the updated > functionality in the package org.apache.lucene.search.fields (field > comparators and parsers, SortField) while adding wrapper instances and > keeping old code in org.apache.lucene.search. > > In detail and besides the above mentioned improvements, the following is > provided: > > 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are > moved from SegmentReader to SegmentReaderIndexCache. > > 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the > close() method to all registered instances by calling an onClose() method > with the threads' instances. > > 3. Analyzer.close now may throw an IOException (this already is covered > by java.io.Closeable). > > 4. A change to Collector: allow IndexCache instead of IndexReader being > passed to setNextReader() > > 5. SortField's numeric types have been replaced by direct assignments of > FieldComparatorSource. This removes the "switch" statements and the > possibility to throw IllegalArgumentExceptions because of unsupported type > values. > > The following classes have been deprecated and replaced by new classes in > org.apache.lucene.search.fields: > > - FieldCacheRangeFilter (=> IndexFieldCacheRangeFilter) > > - FieldCacheTermsFilter (=> IndexFieldCacheTermsFilter) > > - FieldCache (=> IndexFieldCache) > > - FieldCacheImpl (=> IndexFieldCacheImpl) > > - all classes in FieldCacheImpl (=> several package-level classes) > > - all subclasses of FieldComparator (=> several package-level classes) > > Final notes: > > - The patch would be simpler if no backwards compatibility was necessary. > The Lucene community has to decide which classes/methods can immediately be > removed, which ones later, which not at all. Whenever new classes depend on > the old ones, an appropriate notice exists in the javadocs. > > - The patch introduces a new, deprecated class > IndexFieldCacheSanityChecker.java which is just there for testing purposes, > to show that no sanity checks are necessary any longer. This class may be > removed at any time. > > - I expect that the patch does not impact performance. On the contrary, > as the patch removes a few unnecessary checks we might even see a slight > speedup. No benchmarking has been done so far, though. > > - I have tried to preserve the existing functionality wherever possible > and to focus on the class/method structure only. We certainly may improve > the caches' behavior, but this out of scope for this patch. > > - The refactoring finally makes the high duplication of code visible: For > all supported atomic types (byte, double, float, int, long, short) three > classes each are required: *Cache, *Comparator and *Parser. I think that > further simplification might be possible (maybe using Java generics?), but I > guess the current patch is large enough for now. > > Cheers, > > Christian > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > >