[jira] [Created] (SOLR-7058) Data-driven schema needs to index large text fields as text and not as string

Timothy Potter (JIRA) Wed, 28 Jan 2015 15:32:56 -0800

Timothy Potter created SOLR-7058:
------------------------------------

             Summary: Data-driven schema needs to index large text fields as 
text and not as string
                 Key: SOLR-7058
                 URL: https://issues.apache.org/jira/browse/SOLR-7058
             Project: Solr
          Issue Type: Improvement
          Components: Data-driven Schema
            Reporter: Timothy Potter



While using the SimplePostTool to index some freebase articles into a core that 
uses our data-driven configs, I ran into the following gem:

{code}
Caused by: java.lang.IllegalArgumentException: Document contains at least one 
immense term in field="xml_data" (whose UTF8 encoding is longer than the max 
length 32766), all of which were skipped.  Please correct the analyzer to not 
produce such terms.  The prefix of the first immense term is: '[60, 63, 120, 
109, 108, 32, 118, 101, 114, 115, 105, 111, 110, 61, 34, 49, 46, 48, 34, 32, 
101, 110, 99, 111, 100, 105, 110, 103, 61, 34]...', original message: bytes can 
be at most 32766 in length; got 173684
        at 
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:667)
        at 
org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
        at 
org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
        at 
org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:231)
        at 
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:449)
        at 
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1415)
        at 
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:242)
        at 
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)
{code}

Ideally, the data-driven configs would index large text fields containing 
multiple tokens (whitespace delimited) as text and not a string. However, this 
obviously poses an issue if the first doc has a short text value that looks 
like a string and then the next doc has a large one. Not sure what the right 
solution looks like yet, but wanted to capture the issue so we can discuss 
options.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SOLR-7058) Data-driven schema needs to index large text fields as text and not as string

Reply via email to