[jira] [Commented] (SOLR-8495) Schemaless mode cannot index large text fields

Steve Rowe (JIRA) Wed, 21 Sep 2016 17:02:18 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-8495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511574#comment-15511574
 ]


Steve Rowe commented on SOLR-8495:
----------------------------------

I looked at [~caomanhdat]'s patch, and I think there's way more machinery there 
than we need to address the problem.  A couple things I noticed:

* ChunkTokenizer splits values at a maximum token length (rather than 
truncating), but I can't think of a good use for that behavior.
* ParseLongStringFieldUpdateProcessorFactory extends 
NumericFieldUpdateProcessorFactory, which doesn't make sense, since there's no 
parsing going on, and LongStringField isn't numeric. 
* ParseLongStringFieldUpdateProcessor.mutateValue() uses 
String.getBytes(Charset.defaultCharset()) to determine a value's length, but 
Lucene will use UTF-8 to string terms, so UTF-8 should be used when testing 
value lengths. 
* I don't think we need new tokenizers or processors or field types here.

I agree with [~hossman] that his SOLR-9526 approach is the way to go (including 
his TruncateFieldUpdateProcessorFactory idea mentioned above, to address the 
problem described on this issue - his suggested "10000" limit neatly avoids 
worrying about encoded length issues, since each char can take up at most 3 
UTF-8 encoded bytes, and 3*10000 is less than the 32,766 byte 
IndexWriter.MAX_TERM_LENGTH.)

{quote}
bq. Autodetect space-separated text above a (customizable? maybe 256 bytes or 
so by default?) threshold as tokenized text rather than as StrField.
I'm leary of this an approach like this, because it would be extremely trappy 
depending on the order docs were indexed
{quote}

I agree, hoss'ss SOLR-9526 approach will index everything as text_general but 
then add "string" fieldtype copies for values that aren't "too long".

> Schemaless mode cannot index large text fields
> ----------------------------------------------
>
>                 Key: SOLR-8495
>                 URL: https://issues.apache.org/jira/browse/SOLR-8495
>             Project: Solr
>          Issue Type: Bug
>          Components: Data-driven Schema, Schema and Analysis
>    Affects Versions: 4.10.4, 5.3.1, 5.4
>            Reporter: Shalin Shekhar Mangar
>              Labels: difficulty-easy, impact-medium
>             Fix For: 5.5, 6.0
>
>         Attachments: SOLR-8495.patch
>
>
> The schemaless mode by default indexes all string fields into an indexed 
> StrField which is limited to 32KB text. Anything larger than that leads to an 
> exception during analysis.
> {code}
> Caused by: java.lang.IllegalArgumentException: Document contains at least one 
> immense term in field="text" (whose UTF8 encoding is longer than the max 
> length 32766)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-8495) Schemaless mode cannot index large text fields

Reply via email to