[ 
https://issues.apache.org/jira/browse/LUCENE-8501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16617541#comment-16617541
 ] 

Adrien Grand commented on LUCENE-8501:
--------------------------------------

bq. In my case I'd be a bit worried about the loss of precision with the 16 bit 
encoding

Do you know how many values per field you expect at most? For instance using 24 
bits by shifting the bits of the float representation right by 7 instead of 15 
would retain more accuracy while allowing for about 128 values per field per 
document. In general scoring doesn't focus on accuracy: we are happy with 
recording lengths on a single byte, using Math.log(1+x) rather than 
Math.log1p(x) or tweaking scoring formulas to add ones if it can help avoid 
dividing by zero. Better accuracy doesn't improve ranking significantly.

bq. I'm still wondering would it make sense to allow the users to overwrite the 
sum function for different use-cases.

It might... but such extension points have a significant impact on the API and 
testing. In general we'd rather not add them unless there is a strong case to 
introduce them. Also there are ramifications: if we change the way that the 
length is computed, then we also need to change the way that frequencies are 
combined when a field has the same value twice, we also need to worry about how 
to reflect it on index statistics like totalTermFreq and sumTotalTermFreq, etc.



> An ability to define the sum method for custom term frequencies
> ---------------------------------------------------------------
>
>                 Key: LUCENE-8501
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8501
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Olli Kuonanoja
>            Priority: Major
>
> Custom term frequencies allows expert users to index and score in custom 
> ways, however, _DefaultIndexingChain_ adds a limitation to this as the sum of 
> frequencies can't overflow
> {code:java}
> try {
>     invertState.length = Math.addExact(invertState.length, 
> invertState.termFreqAttribute.getTermFrequency());
> } catch (ArithmeticException ae) {
>     throw new IllegalArgumentException("too many tokens for field \"" + 
> field.name() + "\"");
> }
> {code}
> This might become an issue if for example the frequency data is encoded in a 
> different way, say the specific scorer works with float frequencies.
> The sum method can be added to _TermFrequencyAttribute_ to get something like
> {code:java}
> invertState.length = 
> invertState.termFreqAttribute.addFrequency(invertState.length);
> {code}
> so users may define the summing method and avoid the owerflow exceptions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to