[ 
https://issues.apache.org/jira/browse/LUCENE-8501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16617478#comment-16617478
 ] 

Olli Kuonanoja commented on LUCENE-8501:
----------------------------------------

Thanks for the pointer [~jpountz]. In my case I'd be a bit worried about the 
loss of precision with the 16 bit encoding, can't say for sure without proper 
testing how much it would affect the results. However, the storage efficiency 
has not been an issue for me in practise. One more issue I forgot to point out 
in the original description is the value of _invertState.length_ becomes 
useless for similarities as it is always the sum of the integer 
representations. Using a fixed point encoding would be a workaround for that 
but I'm still wondering would it make sense to allow the users to overwrite the 
sum function for different use-cases.

> An ability to define the sum method for custom term frequencies
> ---------------------------------------------------------------
>
>                 Key: LUCENE-8501
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8501
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Olli Kuonanoja
>            Priority: Major
>
> Custom term frequencies allows expert users to index and score in custom 
> ways, however, _DefaultIndexingChain_ adds a limitation to this as the sum of 
> frequencies can't overflow
> {code:java}
> try {
>     invertState.length = Math.addExact(invertState.length, 
> invertState.termFreqAttribute.getTermFrequency());
> } catch (ArithmeticException ae) {
>     throw new IllegalArgumentException("too many tokens for field \"" + 
> field.name() + "\"");
> }
> {code}
> This might become an issue if for example the frequency data is encoded in a 
> different way, say the specific scorer works with float frequencies.
> The sum method can be added to _TermFrequencyAttribute_ to get something like
> {code:java}
> invertState.length = 
> invertState.termFreqAttribute.addFrequency(invertState.length);
> {code}
> so users may define the summing method and avoid the owerflow exceptions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to