[jira] [Commented] (LUCENE-10048) Bypass total frequency check if field uses custom term frequency

Tony Xu (Jira) Fri, 13 Aug 2021 10:27:06 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398819#comment-17398819
 ]


Tony Xu commented on LUCENE-10048:
----------------------------------

Thanks Robert and Mike for the context. I also think we shouldn't have any 
option to  "allow me to corrupt my index". 

 

For this discussion I can think of two major kind of usecases for custom term 
frequency.

1) Users would like to override the frequency of certain token to boost scoring.

2) Users want to encode a generic map of term -> int, where the values could 
have nothing to do with frequency. Users are aware that scoring for this field 
won't make sense.

 

For case 1), we should enforce the checks to ensure scoring works, therefore 
not changing the field length check. In case 2), instead of dangerously 
bypassing, we can use the boolean flag to change field length accumulation from 
adding term's freq to simply increase by one (counting terms). I believe this 
gives the users flexibility without risking index corruption. 

> Bypass total frequency check if field uses custom term frequency
> ----------------------------------------------------------------
>
>                 Key: LUCENE-10048
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10048
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Tony Xu
>            Priority: Minor
>
> For all fields whose index option is not *IndexOptions.NONE*. There is a 
> check on per field total token count (i.e. field-length) to ensure we don't 
> index too many tokens. This is done by accumulating the token's 
> *TermFrequencyAttribute.*
>  
> Given that currently Lucene allows custom term frequency attached to each 
> token and the usage of the frequency can be pretty wild. It is possible to 
> have the following case where the check fails with only a few tokens that 
> have large frequencies. Currently Lucene will skip indexing the whole 
> document.
> *"foo|<very large number> bar|<very large number>"*
>  
> What should be way to inform the indexing chain not to check the field length?
> A related observation, when custom term frequency is in use, user is not 
> likely to use the similarity for this field. Maybe we can offer a way to 
> specify that, too?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-10048) Bypass total frequency check if field uses custom term frequency

Reply via email to