Hello Elasticsearch Community, We index content of some files. We use apache tika to extract the content. What I'm worried about is that some of the documents contain "junk" content, like a lot of numbers in excel. In such a case we'll pollute the indexing with many tokens, but they'll no useful at all as nobody will search for them. Similar thing if someone pastes binary data into a text file.
Is there a good way (in es or external) to detect if a content may be "junk"? Thanks, Igor -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/24e84e90-0569-45d9-ba6f-1974970bc0da%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
