[I] Bypass total frequency check if field uses custom term frequency [LUCENE-10048] [lucene]

via GitHub Fri, 16 Jan 2026 08:30:12 -0800


asfimport opened a new issue, #11086:
URL: https://github.com/apache/lucene/issues/11086


   For all fields whose index option is not **IndexOptions.NONE**. There is a 
check on per field total token count (i.e. field-length) to ensure we don't 
index too many tokens. This is done by accumulating the token's 
**TermFrequencyAttribute.**
   
    
   
   Given that currently Lucene allows custom term frequency attached to each 
token and the usage of the frequency can be pretty wild. It is possible to have 
the following case where the check fails with only a few tokens that have large 
frequencies. Currently Lucene will skip indexing the whole document.
   
   **"foo|&lt;very large number&gt; bar|&lt;very large number&gt;"**
   
    
   
   What should be way to inform the indexing chain not to check the field 
length?
   
   A related observation, when custom term frequency is in use, user is not 
likely to use the similarity for this field. Maybe we can offer a way to 
specify that, too?
   
   
   
   ---
   Migrated from 
[LUCENE-10048](https://issues.apache.org/jira/browse/LUCENE-10048) by Tony Xu 
(@Tony-X), 1 vote, resolved Aug 13 2021
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Bypass total frequency check if field uses custom term frequency [LUCENE-10048] [lucene]

Reply via email to