Pierre Villard created NIFI-12850:
-------------------------------------
Summary: Failure to index Provenance Events with large attributes
Key: NIFI-12850
URL: https://issues.apache.org/jira/browse/NIFI-12850
Project: Apache NiFi
Issue Type: Improvement
Components: Core Framework
Reporter: Pierre Villard
Assignee: Pierre Villard
{code:java}
ERROR org.apache.nifi.provenance.index.lucene.EventIndexTask: Failed to index
Provenance Events java.lang.IllegalArgumentException: Document contains at
least one immense term in field="filename" (whose UTF8 encoding is longer than
the max length 32766), all of which were skipped. Please correct the analyzer
to not produce such terms. The prefix of the first immense term is: '[49, 50,
55, 48, 54, 50, 51, 55, 51, 57, 51, 52, 53, 50, 56, 51, 53, 46, 48, 46, 97,
118, 114, 111, 46, 48, 46, 97, 118, 114]...', original message: bytes can be at
most 32766 in length; got 74483 at
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:984)
at
org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:527)
at
org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:491)
at
org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:208)
at
org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:415)
at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1471)
at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1444) at
org.apache.nifi.provenance.lucene.LuceneEventIndexWriter.index(LuceneEventIndexWriter.java:70)
at
org.apache.nifi.provenance.index.lucene.EventIndexTask.index(EventIndexTask.java:202)
at
org.apache.nifi.provenance.index.lucene.EventIndexTask.run(EventIndexTask.java:113)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at
java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750) Caused by:
org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can
be at most 32766 in length; got 74483 at
org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:281) at
org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:182) at
org.apache.lucene.index.DefaultIndexingChain$PerField. {code}
Looking at the code, it looks like filename is the only attribute that could be
set with arbitrary values that is not protected against overly large values
right now.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)