[
https://issues.apache.org/jira/browse/LUCENE-8947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907086#comment-16907086
]
Adrien Grand commented on LUCENE-8947:
--------------------------------------
Changing it to a long might be challenging for norms, since the current
encoding relies on the fact that the length is an integer. Are you using norms,
I guess not? Maybe we could skip computing the field length when norms are
disabled?
> Indexing fails with "too many tokens for field" when using custom term
> frequencies
> ----------------------------------------------------------------------------------
>
> Key: LUCENE-8947
> URL: https://issues.apache.org/jira/browse/LUCENE-8947
> Project: Lucene - Core
> Issue Type: Improvement
> Affects Versions: 7.5
> Reporter: Michael McCandless
> Priority: Major
>
> We are using custom term frequencies (LUCENE-7854) to index per-token scoring
> signals, however for one document that had many tokens and those tokens had
> fairly large (~998,000) scoring signals, we hit this exception:
> {noformat}
> 2019-08-05T21:32:37,048 [ERROR] (LuceneIndexing-3-thread-3)
> com.amazon.lucene.index.IndexGCRDocument: Failed to index doc:
> java.lang.IllegalArgumentException: too many tokens for field "foobar"
> at
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:825)
> at
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
> at
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394)
> at
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297)
> at
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450)
> at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291)
> at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264)
> {noformat}
> This is happening in this code in {{DefaultIndexingChain.java}}:
> {noformat}
> try {
> invertState.length = Math.addExact(invertState.length,
> invertState.termFreqAttribute.getTermFrequency());
> } catch (ArithmeticException ae) {
> throw new IllegalArgumentException("too many tokens for field \"" +
> field.name() + "\"");
> }{noformat}
> Where Lucene is accumulating the total length (number of tokens) for the
> field. But total length doesn't really make sense if you are using custom
> term frequencies to hold arbitrary scoring signals? Or, maybe it does make
> sense, if user is using this as simple boosting, but maybe we should allow
> this length to be a {{long}}?
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]