[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13121779#comment-13121779 ]
Uwe Schindler commented on LUCENE-1536: --------------------------------------- Hi Chris, hi Male, I was going to bed after my last post. I had a crisis with two facts in the new API, that do no play nicely together. I thought the whole night about it again and I also started to recode some details last evening, but all was not so fine (but I found lots of problems, so it's a good thing that I started to code - especially on several filters that are not so basic like those which only use FixedFitSet/OpenBitSet): # the hidden implementation of Bits is a nice idea, but has one big problem: Java is a strongly-typed language. If a DocIdSet implements Bits, but you want to wrap it using FilteredDocIdSet, this interface implementation might suddenly go away, because the wrapper class does not implement Bits. If we make FilteredDocIdSet implement Bits, its also wrong, as it might wrap another DocIdSet that is not random access. So I tend to keep DocIdSet abstrcat and let it only expose functions that return a Bits interface. The same is that DocIdSet does not directly implement DocIdSetIterator, it can just return one. So I would strongly recommend to add a method like iterator() that returns a impl and not rely on "marker interfaces". I would favor "Bits DocIdSet.bits()" - would be in line with the iterator method. If the implementing class like FixedBitSet implements it itsself and returns "this" is an implementation detail. If DocIdSet does not allow random access it should expose with an exception thrown by bits or if it returns null. Does not really matter to me. - In general a wrapper like FilteredDocIdSet can do this in one class, wrapping bits() would check if bits() returns non-null, and then wrap another wrapper around bits() that uses match() to filter. The impl of this class is fast and supports both (iterator and bits, if available). # the other thing, I dont like, is the setContainsOnlyLiveDocs setter on DocIdSet. It allows anybody to change the DocIdSet (which should have an API that exposes only read-access). Only classes like FixedBitSet that implement this read-only interface might be able to change it from their own API (means the setter might be in the various DocIdSet implementations in oal.util). A consumer of the filter should not be able to change the DocIdSet behaviour from outside using a public API. I started to rewrite this yesterday and only left the getter in DocIdSet, but added the setter to FixedBitSet, OpenBitSet, DocIdBitSet,... The setter in the abstract base class also violates unmodifiable of EMPTY_DOCIDSET. This impl should be "containsOnlyLiveDocs=true") and this must be unchangeable fixed. # Also DocIdSet is a class not really related solely to Filters, e.g. Scorer extends DocIdSetIterator or DocsEnum extends DocIdSetIterator, Solr Facetting uses DocIdSet. DocIdSet is just a holder class for a bunch of documents exposing a iterator (and a Bits API - this is why I want two getter methods and no interface magic)). The existence of live docs is outside it's scope. I therefore would like a similar API like for scorers, so IndexSearcher can ask the Filter for a DocIdSet based on the given liveDocs (like the scorer method in Weights). The returned DocIdSet would not know if it only has live Docs or not (as the Scorer itsself also does not expose this information). CachingWrapperFilter is little bit special, but this one would always ask the wrapped Filter for a DocidSet without deletions and cache that one, but always return a FilteredDocIdSet bringing the liveDocs passed from IndexSearcher in. The cache would then always be without LiveDocs and easier to maintain. Reopening segments would never need to reload cache. CachingWrapperFilter would just decide on the fact if IndexSearcher passes a liveDocs BitSet or not, if it needs to use it or not (in its own getDocIdSet method). If we have a query and only filter some documents, IndexSearcher already knows about liveDocs from the main scorer and would pass null to the filter. This would remove lots of additional checks to liveDocs. Only the main scorer would know about them, the filter will ignore them (so there is no overhead in CachingWrapperFilter, as it can return the cached filter directly to IndexSearcher, without wrapping). QueryWrapperFilter could pass the liveDocs through the wrapped filter, too. I may have time today to implement some parts of this, should not be to difficult. > if a filter can support random access API, we should use it > ----------------------------------------------------------- > > Key: LUCENE-1536 > URL: https://issues.apache.org/jira/browse/LUCENE-1536 > Project: Lucene - Java > Issue Type: Improvement > Components: core/search > Affects Versions: 2.4 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Labels: gsoc2011, lucene-gsoc-11, mentor > Fix For: 4.0 > > Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, > LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, > LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, > LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, > LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, > LUCENE-1536.patch, LUCENE-1536.patch > > > I ran some performance tests, comparing applying a filter via > random-access API instead of current trunk's iterator API. > This was inspired by LUCENE-1476, where we realized deletions should > really be implemented just like a filter, but then in testing found > that switching deletions to iterator was a very sizable performance > hit. > Some notes on the test: > * Index is first 2M docs of Wikipedia. Test machine is Mac OS X > 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. > * I test across multiple queries. 1-X means an OR query, eg 1-4 > means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 > AND 3 AND 4. "u s" means "united states" (phrase search). > * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, > 95, 98, 99, 99.99999 (filter is non-null but all bits are set), > 100 (filter=null, control)). > * Method high means I use random-access filter API in > IndexSearcher's main loop. Method low means I use random-access > filter API down in SegmentTermDocs (just like deleted docs > today). > * Baseline (QPS) is current trunk, where filter is applied as iterator up > "high" (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org