If your base64 encodes are long, they are going to be splited in a lot of tokens by the standard tokenizer.
Theses tokens are often going to be a lot longer than standard words, so your nGram filter will generate even more tokens, a lot more than with standard text. That may be your problem there. You should really try to strip the encoded images with a simple regex from your documents before indexing them. If you need to keep the source, put the raw text in an unindexed field, and the cleaned one in another. -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJQxjPPD4UXAjX%2Buwi84LSsPeiy0C80uzcb4C1QFxwLzfyjQGA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
