Possibly useful: 
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-limit-token-count-tokenfilter.html#analysis-limit-token-count-tokenfilter



On Thursday, October 23, 2014 3:58:03 PM UTC+1, Igor Kupczyński wrote:
>
> Hello Elasticsearch Community,
>
> We index content of some files. We use apache tika to extract the content. 
> What I'm worried about is that some of the documents contain "junk" 
> content, like a lot of numbers in excel. In such a case we'll pollute the 
> indexing with many tokens, but they'll no useful at all as nobody will 
> search for them. Similar thing if someone pastes binary data into a text 
> file.
>
> Is there a good way (in es or external) to detect if a content may be 
> "junk"?
>
> Thanks,
> Igor
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/8a0690d2-dbe8-424e-8c31-28b340035c5b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to