[jira] Commented: (LUCENE-1536) if a filter can support random access API, we should use it

Michael McCandless (JIRA) Sat, 06 Nov 2010 03:31:10 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928948#action_12928948
 ]


Michael McCandless commented on LUCENE-1536:
--------------------------------------------

bq. Wondering what are your thoughts on fixing filters correctly are?

I think the approach you outlined is the right one!

We already have the APIs in flex (Bits interface for random access, postings 
APIs take a Bits skipDocs); in backporting to 3.x I think we'd just port Bits 
back.

There are some challenges though:

  * We should add a method to Filter to ask it if its already folded in deleted 
docs or not.  So eg if a Filter is random access but doesn't factor in del docs 
we'd have to wrap it so that every random access check also checks del docs 
("AND NOT deleted.get(docID)").

  * We need a coarse heuristic in IndexSearcher to decide when a filter 
"merits" down low application.  Ie, even if a filter is random access, if it's 
rather sparse (< 1% or 2% or something) it's better to apply it the way we do 
today ("up high").  In the current patch it's too coarse (it's either globally 
on or off); it should be based on the filter instead, or maybe the filter 
provides a method and that method defaults to the 1/2% threshold check.

  * I suspect we should invert the "Bits skipDocs" now passed to the flex APIs, 
to be "Bits acceptDocs" instead, so that we don't have to invert every filter.  
This'd also mean changing IndexReader.getDeletedDocs to 
IndexReader.getNotDeleteDocs.

Then I think we simply pass the Bits filter into the Weight.scorer API.

{quote}
I think that any type of solution should support the great feature of Lucene 
queries, for example, FilteredQuery should use that, allowing to build complex 
query expressions without having the mentioned optimization only applied on the 
top level search.
{quote}
Good point -- FilteredQuery should use this same low level API if its filter is 
random access and "dense enough".

{quote}
As most filters results do support random access, either because they use 
OpenBitSet, or because they are built on top of FieldCache functionality, I 
think this feature will give great speed improvements to the query execution 
time.
{quote}

Right, the speed gains are often awesome!

> if a filter can support random access API, we should use it
> -----------------------------------------------------------
>
>                 Key: LUCENE-1536
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1536
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.4
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, 
> LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch
>
>
> I ran some performance tests, comparing applying a filter via
> random-access API instead of current trunk's iterator API.
> This was inspired by LUCENE-1476, where we realized deletions should
> really be implemented just like a filter, but then in testing found
> that switching deletions to iterator was a very sizable performance
> hit.
> Some notes on the test:
>   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
>     10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
>   * I test across multiple queries.  1-X means an OR query, eg 1-4
>     means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
>     AND 3 AND 4.  "u s" means "united states" (phrase search).
>   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
>     95, 98, 99, 99.99999 (filter is non-null but all bits are set),
>     100 (filter=null, control)).
>   * Method high means I use random-access filter API in
>     IndexSearcher's main loop.  Method low means I use random-access
>     filter API down in SegmentTermDocs (just like deleted docs
>     today).
>   * Baseline (QPS) is current trunk, where filter is applied as iterator up
>     "high" (ie in IndexSearcher's search loop).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1536) if a filter can support random access API, we should use it

Reply via email to