Fabrizio Fortino created OAK-9123:
-------------------------------------

             Summary: Error: Document contains at least one immense term
                 Key: OAK-9123
                 URL: https://issues.apache.org/jira/browse/OAK-9123
             Project: Jackrabbit Oak
          Issue Type: Bug
          Components: elastic-search, indexing, search
            Reporter: Fabrizio Fortino
            Assignee: Fabrizio Fortino


{code:java}
11:35:09.400 [I/O dispatcher 1] ERROR o.a.j.o.p.i.e.i.ElasticIndexWriter - Bulk 
item with id /wikipedia/76/84/National Palace (Mexico) failed
org.elasticsearch.ElasticsearchException: Elasticsearch exception 
[type=illegal_argument_exception, reason=Document contains at least one immense 
term in field="text.keyword" (whose UTF8 encoding is longer than the max length 
32766), all of which were skipped. Please correct the analyzer to not produce 
such terms. The prefix of the first immense term is: '[123, 123, 73, 110, 102, 
111, 98, 111, 120, 32, 104, 105, 115, 116, 111, 114, 105, 99, 32, 98, 117, 105, 
108, 100, 105, 110, 103, 10, 124, 110]...', original message: bytes can be at 
most 32766 in length; got 33409]
at 
org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:496)
at 
org.elasticsearch.ElasticsearchException.fromXContent(ElasticsearchException.java:407)
at 
org.elasticsearch.action.bulk.BulkItemResponse.fromXContent(BulkItemResponse.java:138)
at 
org.elasticsearch.action.bulk.BulkResponse.fromXContent(BulkResponse.java:196)
at 
org.elasticsearch.client.RestHighLevelClient.parseEntity(RestHighLevelClient.java:1888)
at 
org.elasticsearch.client.RestHighLevelClient.lambda$performRequestAsyncAndParseEntity$10(RestHighLevelClient.java:1676)
at 
org.elasticsearch.client.RestHighLevelClient$1.onSuccess(RestHighLevelClient.java:1758)
at 
org.elasticsearch.client.RestClient$FailureTrackingResponseListener.onSuccess(RestClient.java:590)
at org.elasticsearch.client.RestClient$1.completed(RestClient.java:333)
at org.elasticsearch.client.RestClient$1.completed(RestClient.java:327)
at org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:122)
at 
org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:181)
at 
org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:448)
at 
org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:338)
at 
org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:265)
at 
org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81)
at 
org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39)
at 
org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114)
at 
org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162)
at 
org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337)
at 
org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315)
at 
org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276)
at 
org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
at 
org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.ElasticsearchException: Elasticsearch exception 
[type=max_bytes_length_exceeded_exception, reason=bytes can be at most 32766 in 
length; got 33409]
at 
org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:496)
at 
org.elasticsearch.ElasticsearchException.fromXContent(ElasticsearchException.java:407)
at 
org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:437)
... 24 common frames omitted{code}

This happens with huge keyword fields since Lucene doesn't allow terms with 
more than 32k bytes.

See 
[https://discuss.elastic.co/t/error-document-contains-at-least-one-immense-term-in-field/66486]

We have decided to always create keyword fields to remove the need to specify 
properties like ordered or facet. In this way every field can be sorted or used 
as facet.

In this specific case the keyword field won't be needed at all but it would be 
hard to decide when include it or not. To solve this we are going to use 
`ignore_above=256` so huge keyword fields will be ignored.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to