[ 
https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1928:
----------------------------------------
    Attachment: NUTCH-1928v4.patch

[~jorgelbg] please check out this new patch. It includes all of the necessary 
additions to build.xml as well as default.properties and the plugin build 
configuration.
What we are missing is your configuration file key, value and description for 
the mimetype-filter.txt files within nutch-default.xml.
Can you please add the latter?
Once this is done this patch is well and truly ready to make it in IMHO.
Thanks Jorge.

> Indexing filter of documents by the MIME type
> ---------------------------------------------
>
>                 Key: NUTCH-1928
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1928
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, plugin
>            Reporter: Jorge Luis Betancourt Gonzalez
>            Assignee: Jorge Luis Betancourt Gonzalez
>              Labels: filter, mime-type, plugin
>             Fix For: 1.10
>
>         Attachments: NUTCH-1928v4.patch, mimetype-patch-v3.patch
>
>
> This allows to filter the indexed documents by the MIME type property of the 
> crawled content. Basically this will allow you to restrict the MIME type of 
> the contents that will be stored in Solr/Elasticsearch index without the need 
> to restrict the crawling/parsing process, so no need to use URLFilter plugin 
> family. Also this address one particular corner case when certain URLs 
> doesn't have any format to filter such as some RSS feeds 
> (http://www.awesomesite.com/feed) and it will end in your index mixed with 
> all your HTML content.
> A configuration can file specified on the {{mimetype.filter.file}} property 
> in the {{nutch-site.xml}}. This file use the same format as the 
> {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an 
> {{allow all}} policy is used instead, so all your crawled documents will be 
> indexed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to