[
https://issues.apache.org/jira/browse/SOLR-7058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jan Høydahl resolved SOLR-7058.
-------------------------------
Resolution: Duplicate
Resolving as duplicate of SOLR-6966.
Again, I think this is a bad idea, it's hopeless to detect the difference, we
need to define a sane default and fix the OOTB ability to also search all text.
Once users get past the basics they'll start customizing the schema through API.
> Data-driven schema needs to index large text fields as text and not as string
> -----------------------------------------------------------------------------
>
> Key: SOLR-7058
> URL: https://issues.apache.org/jira/browse/SOLR-7058
> Project: Solr
> Issue Type: Improvement
> Components: Data-driven Schema
> Reporter: Timothy Potter
>
> While using the SimplePostTool to index some freebase articles into a core
> that uses our data-driven configs, I ran into the following gem:
> {code}
> Caused by: java.lang.IllegalArgumentException: Document contains at least one
> immense term in field="xml_data" (whose UTF8 encoding is longer than the max
> length 32766), all of which were skipped. Please correct the analyzer to not
> produce such terms. The prefix of the first immense term is: '[60, 63, 120,
> 109, 108, 32, 118, 101, 114, 115, 105, 111, 110, 61, 34, 49, 46, 48, 34, 32,
> 101, 110, 99, 111, 100, 105, 110, 103, 61, 34]...', original message: bytes
> can be at most 32766 in length; got 173684
> at
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:667)
> at
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
> at
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
> at
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:231)
> at
> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:449)
> at
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1415)
> at
> org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:242)
> at
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)
> {code}
> Ideally, the data-driven configs would index large text fields containing
> multiple tokens (whitespace delimited) as text and not a string. However,
> this obviously poses an issue if the first doc has a short text value that
> looks like a string and then the next doc has a large one. Not sure what the
> right solution looks like yet, but wanted to capture the issue so we can
> discuss options.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]