[jira] [Commented] (LUCENE-7730) Better encode length normalization in similarities

Robert Muir (JIRA) Tue, 16 May 2017 16:34:18 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013263#comment-16013263
 ]


Robert Muir commented on LUCENE-7730:
-------------------------------------

+1

This solves a hairy problem in a non-intrusive way and is a much better 
tradeoff to users. I ran some basic relevance tests and it all checks out 
including 6x back compat. I see the typical 1% difference in this corpus that i 
would see vs using e.g. a 32 bit integer. But for e.g. very small docs users 
will be much happier and less likely to compalin about the quantization to a 
single byte.

I think it is fine to move TFIDFSimilarity/ClassicSimilarity to misc/. Another 
option is to fold them into one class and clean up the abstractions, fix them 
to use this encoding too. TFIDFSimilarity was really just a migration thing 
(its the pre-4.x Similarity api basically). It is kinda like a rotting 
abstraction/tech debt since it has fallen behind. But I think these days for a 
custom TF/IDF-like scoring, you'd just use Similarity or SimilarityBase so that 
you have all the index statistics and so on? Worth a thought.

When can the old tables and backwards compatibility logic be removed from e.g. 
BM25Similarity? I think that part is important.

> Better encode length normalization in similarities
> --------------------------------------------------
>
>                 Key: LUCENE-7730
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7730
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Adrien Grand
>         Attachments: LUCENE-7730.patch, LUCENE-7730.patch, LUCENE-7730.patch
>
>
> Now that index-time boosts are gone (LUCENE-6819) and that indices record the 
> version that was used to create them (for backward compatibility, 
> LUCENE-7703), we can look into storing the length normalization factor more 
> efficiently.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7730) Better encode length normalization in similarities

Reply via email to