[jira] [Updated] (LUCENE-6198) two phase intersection

Adrien Grand (JIRA) Fri, 13 Feb 2015 01:37:39 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-6198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Adrien Grand updated LUCENE-6198:
---------------------------------
    Attachment: LUCENE-6198.patch

I did some more benchmarking and something that helped was to flatten clauses 
in ConjunctionDISI. This typically means that {{+ "A B"  +C}} is now 
approximated as {{+A +B +C}} instead of {+(+A +B) +C}}. (see attached patch)

Here are results on wikibig:

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev        
        Pct diff
    AndMedPhraseHighTerm       21.19      (6.1%)       19.98      (2.6%)   
-5.7% ( -13% -    3%)
                PKLookup      334.11      (2.1%)      334.82      (2.2%)    
0.2% (  -4% -    4%)
   AndHighPhraseHighTerm       11.64      (4.1%)       11.83      (2.4%)    
1.6% (  -4% -    8%)
    AndHighPhraseMedTerm       19.19      (2.5%)       21.99      (2.1%)   
14.6% (   9% -   19%)
     AndMedPhraseMedTerm       58.27      (6.3%)       67.53      (6.6%)   
15.9% (   2% -   30%)
    AndHighPhraseLowTerm       35.07      (5.6%)       42.46      (6.1%)   
21.1% (   8% -   34%)
     AndMedPhraseLowTerm       93.39      (8.0%)      128.24     (13.3%)   
37.3% (  14% -   63%)
{noformat}

I was curious about the slow down on AndMedPhraseHighTerm. And actually it 
seems to be tied to the fact that terms are not random. For instance one query 
of this task is {{+"los angeles" +title}} which matches 30669 documents. 
However the approximation is {{+los +angeles +title}} and matches 30711 
documents, so approximation in this case only adds overhead.

> two phase intersection
> ----------------------
>
>                 Key: LUCENE-6198
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6198
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-6198.patch, LUCENE-6198.patch, LUCENE-6198.patch, 
> LUCENE-6198.patch, LUCENE-6198.patch, phrase_intersections.tasks
>
>
> Currently some scorers have to do a lot of per-document work to determine if 
> a document is a match. The simplest example is a phrase scorer, but there are 
> others (spans, sloppy phrase, geospatial, etc).
> Imagine a conjunction with two MUST clauses, one that is a term that matches 
> all odd documents, another that is a phrase matching all even documents. 
> Today this conjunction will be very expensive, because the zig-zag 
> intersection is reading a ton of useless positions.
> The same problem happens with filteredQuery and anything else that acts like 
> a conjunction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-6198) two phase intersection

Reply via email to