[jira] [Commented] (LUCENE-1252) Avoid using positions when not all required terms are present

Robert Muir (JIRA) Tue, 21 Oct 2014 06:11:33 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178374#comment-14178374
 ]


Robert Muir commented on LUCENE-1252:
-------------------------------------

Both of those are covered. Yes its proposed as a first-class citizen (on DISI), 
more first-class than your "second-class" rewrite() solution.

It might be good to already look at what BQ and co do with query execution 
today. the fact is, rewrite() is inappropriate.

the main reason rewrite() is second-class here is because scoring is 
*per-segment* in lucene. In trunk today this means:
* filters might get executed with a completely different "plan" for different 
segments (conjunction versus Bits->acceptDocs).
* scorers might get returned as null for some segments (e.g. term doesnt 
exist). BQ doesn't have special null handling in every subscorer that it uses, 
that would be horrible. Instead it eliminates such scorers immediately in 
BooleanWeight's constructor. 
* this like the above can impact the structure of what BQ does (e.g. A or B, if 
B is null, it just returns A instead of a DisjunctionScorer).
* scorers have a cost() api, which tells us additional critical information for 
execution. I've proposed expanding this to much more (additional flags/methods) 
so that BQ can be a quarterback and not the entire football team.

another reason rewrite() is wrong here, is that such two-phase execution is 
only useful for conjunction (or: conjunction-like queries such as 
MinShouldMatch). Otherwise, its useless. Rewriting to two queries when the 
query is not a conjunction will not help performance.

at the low level, I can't see a rewrite() solution being efficient. Today 
ExactPhraseScorer is already coded as:

{code}
while (iterate approximation...) {
  confirm(); // phraseFreq > 0
}
{code}

so in the case, its already ready to be exposed for two-phase execution.

On the other hand with rewrite(), it would either be 2 separate D&Penums, or a 
mess? 

Finally, by having this generic on DISI, this "approximation" becomes much more 
flexible. For example a postings format could implement it for its docsenums, 
if it is able to implement a cheap approximation... perhaps via skiplist-like 
data, or perhaps with an explicit "approximate" cache mechanism specifically 
for this purpose. 

Another example is a filter like a geographic distance filter, could return a 
bounding box here automatically rather than forcing the user to deal with this 
themselves with complex query/filter hacks. At the same time, such a filter 
could signal that it has a linear time nextDoc, and booleanquery can really do 
the best execution. A rewrite() solution really makes this hard, because the 
two things would then be completely separate queries. But if you look at all 
the cases of executing this logic, its very useful for BQ to know that A is a 
superset of B.


> Avoid using positions when not all required terms are present
> -------------------------------------------------------------
>
>                 Key: LUCENE-1252
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1252
>             Project: Lucene - Core
>          Issue Type: Wish
>          Components: core/search
>            Reporter: Paul Elschot
>            Priority: Minor
>              Labels: gsoc2014
>
> In the Scorers of queries with (lots of) Phrases and/or (nested) Spans, 
> currently next() and skipTo() will use position information even when other 
> parts of the query cannot match because some required terms are not present.
> This could be avoided by adding some methods to Scorer that relax the 
> postcondition of next() and skipTo() to something like "all required terms 
> are present, but no position info was checked yet", and implementing these 
> methods for Scorers that do conjunctions: BooleanScorer, PhraseScorer, and 
> SpanScorer/NearSpans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-1252) Avoid using positions when not all required terms are present

Reply via email to