Thank you Cédric Hourcade !

Le vendredi 20 juin 2014 15:32:29 UTC+2, Cédric Hourcade a écrit :
>
> If your base64 encodes are long, they are going to be splited in a lot 
> of tokens by the standard tokenizer. 
>
> Theses tokens are often going to be a lot longer than standard words, 
> so your nGram filter will generate even more tokens, a lot more than 
> with standard text. That may be your problem there. 
>
> You should really try to strip the encoded images with a simple regex 
> from your documents before indexing them. If you need to keep the 
> source, put the raw text in an unindexed field, and the cleaned one in 
> another. 
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/b62f4e12-1b54-4621-986a-93411404f7af%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to