Jorge Luis Betancourt Gonzalez created NUTCH-1928:
-----------------------------------------------------
Summary: Indexing filter of documents by the MIME type
Key: NUTCH-1928
URL: https://issues.apache.org/jira/browse/NUTCH-1928
Project: Nutch
Issue Type: Improvement
Components: indexer, plugin
Reporter: Jorge Luis Betancourt Gonzalez
Fix For: 1.10
This allows to filter the indexed documents by the MIME type property of the
crawled content. Basically this will allow you to restrict the MIME type of the
contents that will be stored in Solr/Elasticsearch index without the need to
restrict the crawling/parsing process, so no need to use URLFilter plugin
family. Also this address one particular corner case when certain URLs doesn't
have any format to filter such as some RSS feeds
(http://www.awesomesite.com/feed) and it will end in your index mixed with all
your HTML content.
A configuration can file specified on the {{mimetype.filter.file}} property in
the {{nutch-site.xml}}. This file use the same format as the
{{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an
{{allow all}} policy is used instead, so all your crawled documents will be
indexed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)