[jira] [Commented] (LUCENE-4198) Allow codecs to index term impacts

Robert Muir (JIRA) Wed, 03 Jan 2018 22:22:22 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16310818#comment-16310818
 ]


Robert Muir commented on LUCENE-4198:
-------------------------------------

{quote}
The similarity API doesn't make it easy to integrate, it currently gives a 
score(docID, freq) API while we'd rather need a score(freq,norm) API, 
especially because this optimization only works if freq and norm are the only 
per-document parameters that can influence the score.
{quote}

Well I think it is fair game to simplify the api so its not strange, i mean we 
need to fix it so you can make changes like this :) A lot of the stuff in 
Similarity was geared at just hiding away the classic tf/idf stuff so that 
other things can work. But it should be the term weighting api and limited to 
that, and there are only 3 components of that: term specificity, term 
frequency, doc length.

Simple example: boosting doesn't need to be in this api, its only there because 
it was needed for crazy queryNorm before. But it never belonged and it just 
adds complexity that isn't needed (and bugs if you forget to multiply it in).

But along the path of this change, I think its best to change the api to 
score(freq,norm). But i don't think we should use a Long/boxing, we could just 
call score(freq,1) for the omitNorms case and thats it (similar to how we pass 
freq=1 when frequencies are omitted). Seems like it would simplify things 
there. This is already what SimilarityBase is doing internally, and it doesn't 
much matter what you substitute in there.

> Allow codecs to index term impacts
> ----------------------------------
>
>                 Key: LUCENE-4198
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4198
>             Project: Lucene - Core
>          Issue Type: Sub-task
>          Components: core/index
>            Reporter: Robert Muir
>         Attachments: LUCENE-4198.patch, LUCENE-4198_flush.patch
>
>
> Subtask of LUCENE-4100.
> Thats an example of something similar to impact indexing (though, his 
> implementation currently stores a max for the entire term, the problem is the 
> same).
> We can imagine other similar algorithms too: I think the codec API should be 
> able to support these.
> Currently it really doesnt: Stefan worked around the problem by providing a 
> tool to 'rewrite' your index, he passes the IndexReader and Similarity to it. 
> But it would be better if we fixed the codec API.
> One problem is that the Postings writer needs to have access to the 
> Similarity. Another problem is that it needs access to the term and 
> collection statistics up front, rather than after the fact.
> This might have some cost (hopefully minimal), so I'm thinking to experiment 
> in a branch with these changes and see if we can make it work well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4198) Allow codecs to index term impacts

Reply via email to