Thank you Cédric Hourcade ! Le vendredi 20 juin 2014 15:32:29 UTC+2, Cédric Hourcade a écrit : > > If your base64 encodes are long, they are going to be splited in a lot > of tokens by the standard tokenizer. > > Theses tokens are often going to be a lot longer than standard words, > so your nGram filter will generate even more tokens, a lot more than > with standard text. That may be your problem there. > > You should really try to strip the encoded images with a simple regex > from your documents before indexing them. If you need to keep the > source, put the raw text in an unindexed field, and the cleaned one in > another. >
-- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b62f4e12-1b54-4621-986a-93411404f7af%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
