[
https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16337720#comment-16337720
]
Adrien Grand commented on LUCENE-4198:
--------------------------------------
Thanks Jim for having a look!
bq. the SlowImpactEnum returns NO_MORE_DOCS when advanceShallow is used, is it
allowed (the contract is that this API should always return a docId greater
than the current doc ?)
I had a look at the patch and it says {{gte}} so that should be fine? When
working on the patch I went back and forth between requiring that upTo is
either gt or gte the current doc and settled on gte which made the API easier
to use. Otherwise I would have needed to make it illegal to call once on
NO_MORE_DOCS, which makes it harder to use in scorers.
bq. Why do you need to compute the impact lazily ? Is it for queries that don't
use score for sorting ?
Queries that don't use the score for sorting should be ok: they won't be using
an ImpactsEnum anyway. Laziness doesn't help term queries, this is more for
conjunctions (and disjunctions that progressively turn into conjunctions like
WANDScorer). Imagine a conjunction between two term queries: one that matches
lots of docs and the other one that matches about 100x fewer documents. The
latter will be used to lead the iteration and the former will be used to
confirm matches. It is quite likely that almost every time it is advanced, the
clause with a higher cost will need to decode an entire block, which also
involves computing scores for all competitive (freq,norm) pairs. If there is a
match, this CPU time is not necessarily wasted, but it is also quite likely
that there is no match, in which case the computed impacts are useless.
Conjunctions do use impacts, but they use the block boundaries (the return
value of advanceShallow) of the clause that has the higher score contribution
(see LUCENE-8135). So in case of the conjunction described above, we will
likely need only impacts that are computed on the second level for the clause
with a higher cost, which is stored every 8 blocks (instead of every block for
the first level). Laziness helps skip computing scores on the lower levels if
they are unused.
> Allow codecs to index term impacts
> ----------------------------------
>
> Key: LUCENE-4198
> URL: https://issues.apache.org/jira/browse/LUCENE-4198
> Project: Lucene - Core
> Issue Type: Sub-task
> Components: core/index
> Reporter: Robert Muir
> Priority: Major
> Attachments: LUCENE-4198-BMW.patch, LUCENE-4198.patch,
> LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch,
> LUCENE-4198_flush.patch
>
>
> Subtask of LUCENE-4100.
> Thats an example of something similar to impact indexing (though, his
> implementation currently stores a max for the entire term, the problem is the
> same).
> We can imagine other similar algorithms too: I think the codec API should be
> able to support these.
> Currently it really doesnt: Stefan worked around the problem by providing a
> tool to 'rewrite' your index, he passes the IndexReader and Similarity to it.
> But it would be better if we fixed the codec API.
> One problem is that the Postings writer needs to have access to the
> Similarity. Another problem is that it needs access to the term and
> collection statistics up front, rather than after the fact.
> This might have some cost (hopefully minimal), so I'm thinking to experiment
> in a branch with these changes and see if we can make it work well.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]