Michael McCandless created LUCENE-8947:
------------------------------------------
Summary: Indexing fails with "too many tokens for field" when
using custom term frequencies
Key: LUCENE-8947
URL: https://issues.apache.org/jira/browse/LUCENE-8947
Project: Lucene - Core
Issue Type: Improvement
Affects Versions: 7.5
Reporter: Michael McCandless
We are using custom term frequencies (LUCENE-7854) to index per-token scoring
signals, however for one document that had many tokens and those tokens had
fairly large (~998,000) scoring signals, we hit this exception:
{noformat}
2019-08-05T21:32:37,048 [ERROR] (LuceneIndexing-3-thread-3)
com.amazon.lucene.index.IndexGCRDocument: Failed to index doc:
java.lang.IllegalArgumentException: too many tokens for field "foobar"
at
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:825)
at
org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
at
org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394)
at
org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297)
at
org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450)
at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291)
at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264)
{noformat}
This is happening in this code in {{DefaultIndexingChain.java}}:
{noformat}
try {
invertState.length = Math.addExact(invertState.length,
invertState.termFreqAttribute.getTermFrequency());
} catch (ArithmeticException ae) {
throw new IllegalArgumentException("too many tokens for field \"" +
field.name() + "\"");
}{noformat}
Where Lucene is accumulating the total length (number of tokens) for the field.
But total length doesn't really make sense if you are using custom term
frequencies to hold arbitrary scoring signals? Or, maybe it does make sense,
if user is using this as simple boosting, but maybe we should allow this length
to be a {{long}}?
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]