[
https://issues.apache.org/jira/browse/LUCENE-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256760#comment-13256760
]
Robert Muir commented on LUCENE-3957:
-------------------------------------
I don't understand why its long and winded, its documented in tons of places in
lucene,
in-fact its actually over-specified in file-formats, for example, because even
in 3.5
the encoding of the normalization byte is an implementation detail of the
Similarity:
its just that you can only use a single byte.
In trunk its definitely overspecified since besides the above, the Similarity
can use
more than a byte if it wants to.
1. Main website (scoring):
http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/scoring.html
{noformat}
Indexing time boosts are preprocessed for storage efficiency and written to the
directory (when writing the document) in a single byte (!) as follows.
...
This composition of 1-byte representation of norms...
...
Encoding and decoding of the resulted float norm in a single byte are done by
the static methods of the class Similarity: encodeNorm() and decodeNorm(). Due
to loss of precision, it is not guaranteed that decode(encode(x)) = x, e.g.
decode(encode(0.89)) = 0.75. At scoring (search) time, this norm is brought
into the score of document as norm(t, d), as shown by the formula in
Similarity.
{noformat}
2. Main website (file formats):
http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html#Normalization%20Factors
{noformat}
Each byte encodes a floating point value. Bits 0-2 contain the 3-bit mantissa,
and bits 3-8 contain the 5-bit exponent.
These are converted to an IEEE single float value as follows:
...
{noformat}
3. Javadocs (Similarity):
http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/core/org/apache/lucene/search/Similarity.html
{noformat}
However the resulted norm value is encoded as a single byte before being
stored. At search time, the norm byte value is read from the index directory
and decoded back to a float norm value. This encoding/decoding, while reducing
index size, comes with the price of precision loss...
Compression of norm values to a single byte saves memory at search time,
because once a field is referenced at search time, its norms - for all
documents - are maintained in memory.
The rationale supporting such lossy compression of norm values is that given
the difficulty (and inaccuracy) of users to express their true information need
by a query, only big differences matter.
{noformat}
> Document precision requirements of setBoost calls
> -------------------------------------------------
>
> Key: LUCENE-3957
> URL: https://issues.apache.org/jira/browse/LUCENE-3957
> Project: Lucene - Java
> Issue Type: Improvement
> Components: general/javadocs
> Affects Versions: 3.5
> Reporter: Jordi Salvat i Alabart
>
> The behaviour of index-time boosts seems pretty erratic (e.g. a boost of 8.0
> produces the exact same score as a boost of 9.0) until you become aware that
> these factors end up encoded in a single byte, with a three-bit mantissa.
> This consumed a whole day of research for us, and I still believe we were
> lucky to spot it, given how deeply dug into the code & documentation this
> information is.
> I suggest adding a small note to the JavaDoc of setBoost methods in Document,
> Fieldable, FieldInvertState, and possibly AbstractField, Field, and
> NumericField.
> Suggested text:
> "Note that all index-time boost values end up encoded using
> Similarity.encodeNormValue, with a 3-bit mantissa -- so differences in the
> boost value of less than 25% may easily be rounded away."
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]