[jira] [Commented] (LUCENE-1536) if a filter can support random access API, we should use it

Uwe Schindler (Commented) (JIRA) Thu, 06 Oct 2011 00:52:56 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13121779#comment-13121779
 ]


Uwe Schindler commented on LUCENE-1536:
---------------------------------------

Hi Chris, hi Male,

I was going to bed after my last post. I had a crisis with two facts in the new 
API, that do no play nicely together. I thought the whole night about it again 
and I also started to recode some details last evening, but all was not so fine 
(but I found lots of problems, so it's a good thing that I started to code - 
especially on several filters that are not so basic like those which only use 
FixedFitSet/OpenBitSet):

# the hidden implementation of Bits is a nice idea, but has one big problem: 
Java is a strongly-typed language. If a DocIdSet implements Bits, but you want 
to wrap it using FilteredDocIdSet, this interface implementation  might 
suddenly go away, because the wrapper class does not implement Bits. If we make 
FilteredDocIdSet implement Bits, its also wrong, as it might wrap another 
DocIdSet that is not random access. So I tend to keep DocIdSet abstrcat and let 
it only expose functions that return a Bits interface. The same is that 
DocIdSet does not directly implement DocIdSetIterator, it can just return one. 
So I would strongly recommend to add a method like iterator() that returns a 
impl and not rely on "marker interfaces". I would favor "Bits DocIdSet.bits()" 
- would be in line with the iterator method. If the implementing class like 
FixedBitSet implements it itsself and returns "this" is an implementation 
detail. If DocIdSet does not allow random access it should expose with an 
exception thrown by bits or if it returns null. Does not really matter to me. - 
In general a wrapper like FilteredDocIdSet can do this in one class, wrapping 
bits() would check if bits() returns non-null, and then wrap another wrapper 
around bits() that uses match() to filter. The impl of this class is fast and 
supports both (iterator and bits, if available).
# the other thing, I dont like, is the setContainsOnlyLiveDocs setter on 
DocIdSet. It allows anybody to change the DocIdSet (which should have an API 
that exposes only read-access). Only classes like FixedBitSet that implement 
this read-only interface might be able to change it from their own API (means 
the setter might be in the various DocIdSet implementations in oal.util). A 
consumer of the filter should not be able to change the DocIdSet behaviour from 
outside using a public API. I started to rewrite this yesterday and only left 
the getter in DocIdSet, but added the setter to FixedBitSet, OpenBitSet, 
DocIdBitSet,... The setter in the abstract base class also violates 
unmodifiable of EMPTY_DOCIDSET. This impl should be 
"containsOnlyLiveDocs=true") and this must be unchangeable fixed.
# Also DocIdSet is a class not really related solely to Filters, e.g. Scorer 
extends DocIdSetIterator or DocsEnum extends DocIdSetIterator, Solr Facetting 
uses DocIdSet. DocIdSet is just a holder class for a bunch of documents 
exposing a iterator (and a Bits API - this is why I want two getter methods and 
no interface magic)). The existence of live docs is outside it's scope. I 
therefore would like a similar API like for scorers, so IndexSearcher can ask 
the Filter for a DocIdSet based on the given liveDocs (like the scorer method 
in Weights). The returned DocIdSet would not know if it only has live Docs or 
not (as the Scorer itsself also does not expose this information). 
CachingWrapperFilter is little bit special, but this one would always ask the 
wrapped Filter for a DocidSet without deletions and cache that one, but always 
return a FilteredDocIdSet bringing the liveDocs passed from IndexSearcher in. 
The cache would then always be without LiveDocs and easier to maintain. 
Reopening segments would never need to reload cache. CachingWrapperFilter would 
just decide on the fact if IndexSearcher passes a liveDocs BitSet or not, if it 
needs to use it or not (in its own getDocIdSet method). If we have a query and 
only filter some documents, IndexSearcher already knows about liveDocs from the 
main scorer and would pass null to the filter. This would remove lots of 
additional checks to liveDocs. Only the main scorer would know about them, the 
filter will ignore them (so there is no overhead in CachingWrapperFilter, as it 
can return the cached filter directly to IndexSearcher, without wrapping). 
QueryWrapperFilter could pass the liveDocs through the wrapped filter, too.

I may have time today to implement some parts of this, should not be to 
difficult.
                
> if a filter can support random access API, we should use it
> -----------------------------------------------------------
>
>                 Key: LUCENE-1536
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1536
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: 2.4
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>              Labels: gsoc2011, lucene-gsoc-11, mentor
>             Fix For: 4.0
>
>         Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, 
> LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
> LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
> LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
> LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
> LUCENE-1536.patch, LUCENE-1536.patch
>
>
> I ran some performance tests, comparing applying a filter via
> random-access API instead of current trunk's iterator API.
> This was inspired by LUCENE-1476, where we realized deletions should
> really be implemented just like a filter, but then in testing found
> that switching deletions to iterator was a very sizable performance
> hit.
> Some notes on the test:
>   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
>     10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
>   * I test across multiple queries.  1-X means an OR query, eg 1-4
>     means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
>     AND 3 AND 4.  "u s" means "united states" (phrase search).
>   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
>     95, 98, 99, 99.99999 (filter is non-null but all bits are set),
>     100 (filter=null, control)).
>   * Method high means I use random-access filter API in
>     IndexSearcher's main loop.  Method low means I use random-access
>     filter API down in SegmentTermDocs (just like deleted docs
>     today).
>   * Baseline (QPS) is current trunk, where filter is applied as iterator up
>     "high" (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-1536) if a filter can support random access API, we should use it

Reply via email to