Hi Adrien,
Thank you for the great explanation!
Koji
On 2017/08/22 19:36, Adrien Grand wrote:
Yes, LUCENE-7730 is the issue.
Le mar. 22 août 2017 à 12:00, Koji Sekiguchi <[email protected]
<mailto:[email protected]>> a écrit :
I thought LUCENE-6819 removed the single byte float as well because to
describe the background of
the ticket, you mentioned it was poor precision. So I thought the ticket
solved it (from the
context).
So the field length is still stored in the single byte and the precision of
the float still not
good? And the point of the LUCENE-6819 is that we can set more precise
boost value if we want
because it no longer depends on the poor precision single byte for field
length?
We still use a single byte in order to store the norm. The difference is that before we used to
store ${index-boost} * ${length-norm}. Because index-boosts could take any positive value, we could
not make any assumptions about this quantity that could have helped make storage more efficient.
More concretely, length-norm was always between 0 and 1, so if you did not use index boosts like
most Lucene users, then the final normalization factor would be in 0-1 as well. Yet only 125 out of
the 256 bytes that the SmallFloat encoding that we used represent values between 0 and 1. So this
feature was trading accuracy of the length normalization factor in favor of a feature that was only
used by a minority and could be easily replaced by a doc-value field.
We actually went a bit further and started storing the document length rather than the precomputed
length-normalization factor in the norms field. It is easier to reason about since we know all
values are integers, positive, and that we want to have better accuracy for lower values. This
allowed to encode lengths accurately up to 40, while the previous encoding that we used considered 3
and 4 to be the same lengths for instance. Then accuracy degrades progressively as you can notice on
the LUCENE-7730 ticket.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]