[jira] [Commented] (LUCENE-4198) Allow codecs to index term impacts

Adrien Grand (JIRA) Wed, 24 Jan 2018 07:10:32 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16337720#comment-16337720
 ]


Adrien Grand commented on LUCENE-4198:
--------------------------------------

Thanks Jim for having a look!

bq. the SlowImpactEnum returns NO_MORE_DOCS when advanceShallow is used, is it 
allowed (the contract is that this API should always return a docId greater 
than the current doc ?)

I had a look at the patch and it says {{gte}} so that should be fine? When 
working on the patch I went back and forth between requiring that upTo is 
either gt or gte the current doc and settled on gte which made the API easier 
to use. Otherwise I would have needed to make it illegal to call once on 
NO_MORE_DOCS, which makes it harder to use in scorers.

bq. Why do you need to compute the impact lazily ? Is it for queries that don't 
use score for sorting ?

Queries that don't use the score for sorting should be ok: they won't be using 
an ImpactsEnum anyway. Laziness doesn't help term queries, this is more for 
conjunctions (and disjunctions that progressively turn into conjunctions like 
WANDScorer). Imagine a conjunction between two term queries: one that matches 
lots of docs and the other one that matches about 100x fewer documents. The 
latter will be used to lead the iteration and the former will be used to 
confirm matches. It is quite likely that almost every time it is advanced, the 
clause with a higher cost will need to decode an entire block, which also 
involves computing scores for all competitive (freq,norm) pairs. If there is a 
match, this CPU time is not necessarily wasted, but it is also quite likely 
that there is no match, in which case the computed impacts are useless.

Conjunctions do use impacts, but they use the block boundaries (the return 
value of advanceShallow) of the clause that has the higher score contribution 
(see LUCENE-8135). So in case of the conjunction described above, we will 
likely need only impacts that are computed on the second level for the clause 
with a higher cost, which is stored every 8 blocks (instead of every block for 
the first level). Laziness helps skip computing scores on the lower levels if 
they are unused.

> Allow codecs to index term impacts
> ----------------------------------
>
>                 Key: LUCENE-4198
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4198
>             Project: Lucene - Core
>          Issue Type: Sub-task
>          Components: core/index
>            Reporter: Robert Muir
>            Priority: Major
>         Attachments: LUCENE-4198-BMW.patch, LUCENE-4198.patch, 
> LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch, 
> LUCENE-4198_flush.patch
>
>
> Subtask of LUCENE-4100.
> Thats an example of something similar to impact indexing (though, his 
> implementation currently stores a max for the entire term, the problem is the 
> same).
> We can imagine other similar algorithms too: I think the codec API should be 
> able to support these.
> Currently it really doesnt: Stefan worked around the problem by providing a 
> tool to 'rewrite' your index, he passes the IndexReader and Similarity to it. 
> But it would be better if we fixed the codec API.
> One problem is that the Postings writer needs to have access to the 
> Similarity. Another problem is that it needs access to the term and 
> collection statistics up front, rather than after the fact.
> This might have some cost (hopefully minimal), so I'm thinking to experiment 
> in a branch with these changes and see if we can make it work well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4198) Allow codecs to index term impacts

Reply via email to