[ https://issues.apache.org/jira/browse/SOLR-8495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511574#comment-15511574 ]
Steve Rowe commented on SOLR-8495: ---------------------------------- I looked at [~caomanhdat]'s patch, and I think there's way more machinery there than we need to address the problem. A couple things I noticed: * ChunkTokenizer splits values at a maximum token length (rather than truncating), but I can't think of a good use for that behavior. * ParseLongStringFieldUpdateProcessorFactory extends NumericFieldUpdateProcessorFactory, which doesn't make sense, since there's no parsing going on, and LongStringField isn't numeric. * ParseLongStringFieldUpdateProcessor.mutateValue() uses String.getBytes(Charset.defaultCharset()) to determine a value's length, but Lucene will use UTF-8 to string terms, so UTF-8 should be used when testing value lengths. * I don't think we need new tokenizers or processors or field types here. I agree with [~hossman] that his SOLR-9526 approach is the way to go (including his TruncateFieldUpdateProcessorFactory idea mentioned above, to address the problem described on this issue - his suggested "10000" limit neatly avoids worrying about encoded length issues, since each char can take up at most 3 UTF-8 encoded bytes, and 3*10000 is less than the 32,766 byte IndexWriter.MAX_TERM_LENGTH.) {quote} bq. Autodetect space-separated text above a (customizable? maybe 256 bytes or so by default?) threshold as tokenized text rather than as StrField. I'm leary of this an approach like this, because it would be extremely trappy depending on the order docs were indexed {quote} I agree, hoss'ss SOLR-9526 approach will index everything as text_general but then add "string" fieldtype copies for values that aren't "too long". > Schemaless mode cannot index large text fields > ---------------------------------------------- > > Key: SOLR-8495 > URL: https://issues.apache.org/jira/browse/SOLR-8495 > Project: Solr > Issue Type: Bug > Components: Data-driven Schema, Schema and Analysis > Affects Versions: 4.10.4, 5.3.1, 5.4 > Reporter: Shalin Shekhar Mangar > Labels: difficulty-easy, impact-medium > Fix For: 5.5, 6.0 > > Attachments: SOLR-8495.patch > > > The schemaless mode by default indexes all string fields into an indexed > StrField which is limited to 32KB text. Anything larger than that leads to an > exception during analysis. > {code} > Caused by: java.lang.IllegalArgumentException: Document contains at least one > immense term in field="text" (whose UTF8 encoding is longer than the max > length 32766) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org