[ 
https://issues.apache.org/jira/browse/SOLR-7058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl resolved SOLR-7058.
-------------------------------
    Resolution: Duplicate

Resolving as duplicate of SOLR-6966.

Again, I think this is a bad idea, it's hopeless to detect the difference, we 
need to define a sane default and fix the OOTB ability to also search all text. 
Once users get past the basics they'll start customizing the schema through API.

> Data-driven schema needs to index large text fields as text and not as string
> -----------------------------------------------------------------------------
>
>                 Key: SOLR-7058
>                 URL: https://issues.apache.org/jira/browse/SOLR-7058
>             Project: Solr
>          Issue Type: Improvement
>          Components: Data-driven Schema
>            Reporter: Timothy Potter
>
> While using the SimplePostTool to index some freebase articles into a core 
> that uses our data-driven configs, I ran into the following gem:
> {code}
> Caused by: java.lang.IllegalArgumentException: Document contains at least one 
> immense term in field="xml_data" (whose UTF8 encoding is longer than the max 
> length 32766), all of which were skipped.  Please correct the analyzer to not 
> produce such terms.  The prefix of the first immense term is: '[60, 63, 120, 
> 109, 108, 32, 118, 101, 114, 115, 105, 111, 110, 61, 34, 49, 46, 48, 34, 32, 
> 101, 110, 99, 111, 100, 105, 110, 103, 61, 34]...', original message: bytes 
> can be at most 32766 in length; got 173684
>       at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:667)
>       at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
>       at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
>       at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:231)
>       at 
> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:449)
>       at 
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1415)
>       at 
> org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:242)
>       at 
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)
> {code}
> Ideally, the data-driven configs would index large text fields containing 
> multiple tokens (whitespace delimited) as text and not a string. However, 
> this obviously poses an issue if the first doc has a short text value that 
> looks like a string and then the next doc has a large one. Not sure what the 
> right solution looks like yet, but wanted to capture the issue so we can 
> discuss options.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to