[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

Uwe Schindler (Updated) (JIRA) Sat, 08 Oct 2011 02:32:58 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Uwe Schindler updated LUCENE-1536:
----------------------------------

    Attachment: changes-yonik-uwe.patch
                LUCENE-1536.patch
                LUCENE-1536-rewrite.patch

Attached you will find a new patch LUCENE-1536.patch, incorporating Yonik's 
changes plus some minor improvements:
- changed Javadocs of DIS.bits() to explain what you should do/not do.
- Added another early exit condition in FilteredQuery#Weight.scorer(): As we 
already get the first matching doc of the filter iterator before looking at 
bits or creating the query scorer, we should erly exit, if the first matching 
doc is Disi.NO_MORE_DOCS. This   saves us from creating the Query Scorer.
- I removed Robert's safety TODO in SolrIndexSearcher. It no longer disabled 
random access completely. After Yoniks changes, all places in Solr that are not 
random access secure are disabled - e.g. SolrIndexSearcher.FilterImpl (not sure 
what this class does, maybe it should also implement bits()?) - we should do 
that in a Solr specific optimization issue.

Some other cool thing with filters is ANDing filters without ChainedFilter 
(this approach is is very effective with random access as it does not allocate 
additional BitsSet). If you want to AND together several filters and apply them 
to a Query, do the following:

{code:java}
IS.search(new FilteredQuery(query,filter2), filter1,...);
{code}

You can chain even more filters in by adding more FilteredQueries. What this 
does:
IS will automatically create another FilteredQuery to apply the filter and get 
the Weight of the top-level FilteredQuery. The scorer of this one will be 
top-level, get the filter and if it is random access, it will execute the 
filter with acceptDocs==liveDocs. The result bits of this filter will be passed 
to Weight.scorer of the second FilteredQuery as acceptDocs. This one will pass 
the acceptDocs (which are already filtered) to its Filter and if again random 
access pass those as acceptDocs to the inner Query's scorer. Finally the 
top-level IS will execute scorer.score(Collector), which in fact is the inner 
Query's scorer (no wrappers!) with all filtering applied in acceptDocs. This is 
incredible cool :-)

One thing about large patches in an issue:
If you are working on an issue and have you local changes in your checkout and 
posted a patch to an issue and somebody else, posted an updated patch to an 
issue, it is often nice to see the diff between those patches. I wanted to see 
what Yonik changed, but a 140 K patch is not easy to handle. The trick is 
"interdiff" from patchutils package: You can call "interdiff 
LUCENE-1536-original.patch LUCENE-1536-yonik.patch" and you get a patch of only 
changes applied by Yonik. This patch can even be applied to your local already 
patched checkout.

The changes-yonik-uwe.patch was generated that way and shows, what changes I 
did in my last patch in contrast to Yoniks original.
                
> if a filter can support random access API, we should use it
> -----------------------------------------------------------
>
>                 Key: LUCENE-1536
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1536
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: 2.4
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>              Labels: gsoc2011, lucene-gsoc-11, mentor
>             Fix For: 4.0
>
>         Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, 
> LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
> LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
> LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
> LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
> LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
> LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
> LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
> LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
> LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
> LUCENE-1536.patch, changes-yonik-uwe.patch, luceneutil.patch
>
>
> I ran some performance tests, comparing applying a filter via
> random-access API instead of current trunk's iterator API.
> This was inspired by LUCENE-1476, where we realized deletions should
> really be implemented just like a filter, but then in testing found
> that switching deletions to iterator was a very sizable performance
> hit.
> Some notes on the test:
>   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
>     10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
>   * I test across multiple queries.  1-X means an OR query, eg 1-4
>     means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
>     AND 3 AND 4.  "u s" means "united states" (phrase search).
>   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
>     95, 98, 99, 99.99999 (filter is non-null but all bits are set),
>     100 (filter=null, control)).
>   * Method high means I use random-access filter API in
>     IndexSearcher's main loop.  Method low means I use random-access
>     filter API down in SegmentTermDocs (just like deleted docs
>     today).
>   * Baseline (QPS) is current trunk, where filter is applied as iterator up
>     "high" (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

Reply via email to