[ 
https://issues.apache.org/jira/browse/LUCENE-6819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870559#comment-15870559
 ] 

Adrien Grand commented on LUCENE-6819:
--------------------------------------

I agree index-time and search-time boosting have different trade-offs that may 
both be interesting. The problem I have is that supporting index-time boosts 
means that length norm is less accurate for _everyone_. Right now if you do not 
use index-time boosts, which I think is the case for a majority of users, you 
end up with a length norm that is between 0 and 1 ({{1/sqrt(fieldLen)}}). The 
length norm may only be greater than 1 if you use a boost that is greater than 
1. Out of the 256 values that {{SmallFloat.byte315ToFloat}} supports, only 125 
of them are less than or equal to 1, the other 131 values are all greater than 
1. Said otherwise, more than half the norm values we support are wasted if you 
do not use index-time boosts.

If instead we could assume that norms were always between 0 and 1, we could 
take one bit from the exponent and spend it on the mantissa instead to improve 
accuracy. For instance I rebuilt the table that had been built for LUCENE-5005 
and expanded it with a couple more length values, as well as what the rounded 
norm would be if we spent 1 more bit on the mantissa (while still being able to 
encode the norm on a single byte, see the float415 column):

||numTerms||1/sqrt(numTerms)||1/sqrt(numTerms) to float315||1/sqrt(numTerms) to 
float415||
| 1 | 1.0 | 1.0 | 1.0 |
| 2 | 0.70710677 | 0.625 | 0.6875 |
| 3 | 0.57735026 | 0.5 | 0.5625 |
| 4 | 0.5 | 0.5 | 0.5 |
| 5 | 0.4472136 | 0.4375 | 0.4375 |
| 6 | 0.4082483 | 0.375 | 0.40625 |
| 7 | 0.37796447 | 0.375 | 0.375 |
| 8 | 0.35355338 | 0.3125 | 0.34375 |
| 9 | 0.33333334 | 0.3125 | 0.3125 |
| 10 | 0.31622776 | 0.3125 | 0.3125 |
| 11 | 0.30151135 | 0.25 | 0.28125 |
| 12 | 0.28867513 | 0.25 | 0.28125 |
| 13 | 0.2773501 | 0.25 | 0.25 |
| 14 | 0.26726124 | 0.25 | 0.25 |
| 15 | 0.2581989 | 0.25 | 0.25 |
| 16 | 0.25 | 0.25 | 0.25 |
| 17 | 0.24253562 | 0.21875 | 0.234375 |
| 18 | 0.23570226 | 0.21875 | 0.234375 |
| 19 | 0.22941573 | 0.21875 | 0.21875 |
| 20 | 0.2236068 | 0.21875 | 0.21875 |

Something I really like about it is that for all length values between 1 and 9 
included, you get different values for the rounded norms. I have seen several 
users asking why "A B C D" would score as well as "A B C" when the query is eg. 
"A" in spite of being longer, and if we could get this addressed for short 
fields (think eg. product names), I think that would be a great win.

> Deprecate index-time boosts?
> ----------------------------
>
>                 Key: LUCENE-6819
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6819
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Adrien Grand
>            Priority: Minor
>
> Follow-up of this comment: 
> https://issues.apache.org/jira/browse/LUCENE-6818?focusedCommentId=14934801&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14934801
> Index-time boosts are a very expert feature whose behaviour is tight to the 
> Similarity impl. Additionally users have often be confused by the poor 
> precision due to the fact that we encode values on a single byte. But now we 
> have doc values that allow you to encode any values the way you want with as 
> much precision as you need so maybe we should deprecate index-time boosts and 
> recommend to encode index-time scoring factors into doc values fields instead.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to