Sanitize a text for indexing

Bernhard Berger Thu, 12 Mar 2015 01:55:12 -0700

Hi,

while indexing various comments from Facebook I sometimes get Exceptions:


IllegalArgumentException: Document contains at least one immense term...

Is it possible to sanitize a text for indexing in Elasticsearch so it doesn't 
throw these Exceptions? Maybe there is a Filter to remove too-long Unicode 
terms?

For details about the failing documents, see my (unanswered) Stackoverflow 
question: 
http://stackoverflow.com/questions/28941570/remove-long-unicode-terms-from-string-in-java
(I fear to break another Elasticsearch-based (Maillist) crawler, so I better 
don't write the failing doc text here ;-) )

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/93a5ed0d-6486-48b4-a228-1aff47d14ce0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sanitize a text for indexing

Reply via email to