Tony Xu created LUCENE-10048:
--------------------------------

             Summary: Bypass total frequency check if field uses custom term 
frequency
                 Key: LUCENE-10048
                 URL: https://issues.apache.org/jira/browse/LUCENE-10048
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Tony Xu


For all fields whose index option is not *IndexOptions.NONE*. There is a check 
on per field total token count (i.e. field-length) to ensure we don't index too 
many tokens. This is done by accumulating the token's *TermFrequencyAttribute.*

 

**Given that currently Lucene allows custom term frequency attached to each 
token and the usage of the frequency can be pretty wild. It is possible to have 
the following case where the check fails with only a few tokens that have large 
frequencies.

*"foo|<very large number> bar|<very large number>"*

 

What should be way to inform the indexing chain not to check the field length?

A related observation, when custom term frequency is in use, user is not likely 
to use the similarity for this field. Maybe we can offer a way to specify that, 
too?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to