Hello Elasticsearch Community,

We index content of some files. We use apache tika to extract the content. 
What I'm worried about is that some of the documents contain "junk" 
content, like a lot of numbers in excel. In such a case we'll pollute the 
indexing with many tokens, but they'll no useful at all as nobody will 
search for them. Similar thing if someone pastes binary data into a text 
file.

Is there a good way (in es or external) to detect if a content may be 
"junk"?

Thanks,
Igor

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/24e84e90-0569-45d9-ba6f-1974970bc0da%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to