[
https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14301484#comment-14301484
]
Lewis John McGibbney commented on NUTCH-1928:
---------------------------------------------
[~jorgelbg] patch looking much better. Thanks and don;t worry about the coding
standard it is trivial to fix. Thank you for doing so :)
Some more points I would mention,
* Can you ensure that the patch is produce from $NUTCH_HOME and can be applied
from there. You can see how to do so
[here|http://wiki.apache.org/nutch/HowToContribute#Creating_a_patch] e.g. the
git directions.
* I would state that you are using a deprecated syntax for the JUnit tests
(thank you for providing unit tests, this is fantastic). Can you please use the
imports as per
[here|https://github.com/apache/nutch/blob/trunk/src/test/org/apache/nutch/crawl/TestLinkDbMerger.java#L32-L35],
can you also please use the newer JUnit testing annotations as per
[here|https://github.com/apache/nutch/blob/trunk/src/test/org/apache/nutch/crawl/TestLinkDbMerger.java#L68].
Thank you so much [~jorgelbg]
> Indexing filter of documents by the MIME type
> ---------------------------------------------
>
> Key: NUTCH-1928
> URL: https://issues.apache.org/jira/browse/NUTCH-1928
> Project: Nutch
> Issue Type: Improvement
> Components: indexer, plugin
> Reporter: Jorge Luis Betancourt Gonzalez
> Assignee: Jorge Luis Betancourt Gonzalez
> Labels: filter, mime-type, plugin
> Fix For: 1.10
>
> Attachments: mimetype-patch-v2.patch
>
>
> This allows to filter the indexed documents by the MIME type property of the
> crawled content. Basically this will allow you to restrict the MIME type of
> the contents that will be stored in Solr/Elasticsearch index without the need
> to restrict the crawling/parsing process, so no need to use URLFilter plugin
> family. Also this address one particular corner case when certain URLs
> doesn't have any format to filter such as some RSS feeds
> (http://www.awesomesite.com/feed) and it will end in your index mixed with
> all your HTML content.
> A configuration can file specified on the {{mimetype.filter.file}} property
> in the {{nutch-site.xml}}. This file use the same format as the
> {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an
> {{allow all}} policy is used instead, so all your crawled documents will be
> indexed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)