[jira] Updated: (LUCENE-2690) Do MultiTermQuery boolean rewrites per segment

Michael McCandless (JIRA) Sun, 10 Oct 2010 04:22:58 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless updated LUCENE-2690:
---------------------------------------

    Attachment: LUCENE-2690-hack.patch


I attached a hacked patch... nowhere near committable, various tests
fail, etc... yet I think once we clean it up, the approach is viable.

I started from the patch like 2 iterations ago, and then fixed how the
MTQ BQ rewrite works so that instead of the two passes (first to
gather matching terms, second to create weight/scorers & run the BQ),
it now makes a single pass.

In that single pass it records which terms matched which segments, and
creates TermScorer for each.

After the single pass, once we've summed up the top level docFreq for
all terms, I go back and reset the weights for all the TermScorers,
sumSQ them, normalize, etc., and then create a FakeQuery object whose
only purpose is to remember the per-segment scorers and provide them
once .scorer(...) is called on each segment.

The big gain with this approach is you don't waste effort trying to
seek to non-existent terms in the sub readers.  Normally the terms
cache would save you here, but, we never cache a miss and so when we
try to look that up again it's always a real (costly) seek.

With this approach we can disable using the terms cache entirely from
MTQ.rewrite, which is great.

I believe the patch works correctly, at least for this test, because
on my 10M wikipedia index it gets identical top N results as clean
trunk.  Here're the perf gains:

||Query||QPS clean||QPS mtqseg||Pct diff||||
|state|37.49|37.40|{color:red}-0.2%{color}|
|unit*|11.86|20.23|{color:green}70.5%{color}|
|un*d|13.58|30.85|{color:green}127.2%{color}|
|uni*ed|173.22|535.27|{color:green}209.0%{color}|
|u*d|2.61|9.05|{color:green}247.3%{color}|
|un*ed|33.59|120.32|{color:green}258.1%{color}|

Note that these gains already include the sizable gains from the
original patch, but the single pass approach makes further great
gains, especially eg on the prefix query.

I don't think we should couple this new patch w/ this issue... this
issue already has awesome gains with a fairly minor change...
I'll open a new issue.


> Do MultiTermQuery boolean rewrites per segment
> ----------------------------------------------
>
>                 Key: LUCENE-2690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2690
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 4.0
>
>         Attachments: LUCENE-2690-hack.patch, LUCENE-2690.patch, 
> LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch, LUCENE-2690.patch
>
>
> MultiTermQuery currently rewrites FuzzyQuery (using 
> TopTermsBooleanQueryRewrite), the auto constant rewrite method and the 
> ScoringBQ rewrite methods using a MultiFields wrapper on the top-level 
> reader. This is inefficient.
> This patch changes the rewrite modes to do the rewrites per segment and uses 
> some additional datastructures (hashed sets/maps) to exclude duplicate terms. 
> All tests currently pass, but FuzzyQuery's tests should not, because it 
> depends for the minimum score handling, that the terms are collected in 
> order..
> Robert will fix FuzzyQuery in this issue, too. This patch is just a start.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2690) Do MultiTermQuery boolean rewrites per segment

Reply via email to