Timothy Potter created SOLR-7058:
------------------------------------
Summary: Data-driven schema needs to index large text fields as
text and not as string
Key: SOLR-7058
URL: https://issues.apache.org/jira/browse/SOLR-7058
Project: Solr
Issue Type: Improvement
Components: Data-driven Schema
Reporter: Timothy Potter
While using the SimplePostTool to index some freebase articles into a core that
uses our data-driven configs, I ran into the following gem:
{code}
Caused by: java.lang.IllegalArgumentException: Document contains at least one
immense term in field="xml_data" (whose UTF8 encoding is longer than the max
length 32766), all of which were skipped. Please correct the analyzer to not
produce such terms. The prefix of the first immense term is: '[60, 63, 120,
109, 108, 32, 118, 101, 114, 115, 105, 111, 110, 61, 34, 49, 46, 48, 34, 32,
101, 110, 99, 111, 100, 105, 110, 103, 61, 34]...', original message: bytes can
be at most 32766 in length; got 173684
at
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:667)
at
org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
at
org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
at
org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:231)
at
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:449)
at
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1415)
at
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:242)
at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)
{code}
Ideally, the data-driven configs would index large text fields containing
multiple tokens (whitespace delimited) as text and not a string. However, this
obviously poses an issue if the first doc has a short text value that looks
like a string and then the next doc has a large one. Not sure what the right
solution looks like yet, but wanted to capture the issue so we can discuss
options.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]