[ 
https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14332504#comment-14332504
 ] 

Hudson commented on NUTCH-1928:
-------------------------------

SUCCESS: Integrated in Nutch-trunk #2986 (See 
[https://builds.apache.org/job/Nutch-trunk/2986/])
NUTCH-1928 Indexing filter of documents by the MIME type (jorgelbg: 
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1661600)
* /nutch/trunk/build.xml
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/default.properties
* /nutch/trunk/src/plugin/build.xml
* /nutch/trunk/src/plugin/mimetype-filter
* /nutch/trunk/src/plugin/mimetype-filter/build.xml
* /nutch/trunk/src/plugin/mimetype-filter/ivy.xml
* /nutch/trunk/src/plugin/mimetype-filter/plugin.xml
* /nutch/trunk/src/plugin/mimetype-filter/sample
* /nutch/trunk/src/plugin/mimetype-filter/sample/allow-images.txt
* /nutch/trunk/src/plugin/mimetype-filter/sample/block-html.txt
* /nutch/trunk/src/plugin/mimetype-filter/src
* /nutch/trunk/src/plugin/mimetype-filter/src/java
* /nutch/trunk/src/plugin/mimetype-filter/src/java/org
* /nutch/trunk/src/plugin/mimetype-filter/src/java/org/apache
* /nutch/trunk/src/plugin/mimetype-filter/src/java/org/apache/nutch
* /nutch/trunk/src/plugin/mimetype-filter/src/java/org/apache/nutch/indexer
* 
/nutch/trunk/src/plugin/mimetype-filter/src/java/org/apache/nutch/indexer/filter
* 
/nutch/trunk/src/plugin/mimetype-filter/src/java/org/apache/nutch/indexer/filter/MimeTypeIndexingFilter.java
* /nutch/trunk/src/plugin/mimetype-filter/src/test
* /nutch/trunk/src/plugin/mimetype-filter/src/test/org
* /nutch/trunk/src/plugin/mimetype-filter/src/test/org/apache
* /nutch/trunk/src/plugin/mimetype-filter/src/test/org/apache/nutch
* /nutch/trunk/src/plugin/mimetype-filter/src/test/org/apache/nutch/indexer
* 
/nutch/trunk/src/plugin/mimetype-filter/src/test/org/apache/nutch/indexer/filter
* 
/nutch/trunk/src/plugin/mimetype-filter/src/test/org/apache/nutch/indexer/filter/MimeTypeIndexingFilterTest.java


> Indexing filter of documents by the MIME type
> ---------------------------------------------
>
>                 Key: NUTCH-1928
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1928
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, plugin
>            Reporter: Jorge Luis Betancourt Gonzalez
>            Assignee: Jorge Luis Betancourt Gonzalez
>              Labels: filter, mime-type, plugin
>             Fix For: 1.10
>
>         Attachments: NUTCH-1928v4.patch, NUTCH-1928v5.patch, 
> NUTCH-1928v6.patch, mimetype-patch-v3.patch
>
>
> This allows to filter the indexed documents by the MIME type property of the 
> crawled content. Basically this will allow you to restrict the MIME type of 
> the contents that will be stored in Solr/Elasticsearch index without the need 
> to restrict the crawling/parsing process, so no need to use URLFilter plugin 
> family. Also this address one particular corner case when certain URLs 
> doesn't have any format to filter such as some RSS feeds 
> (http://www.awesomesite.com/feed) and it will end in your index mixed with 
> all your HTML content.
> A configuration can file specified on the {{mimetype.filter.file}} property 
> in the {{nutch-site.xml}}. This file use the same format as the 
> {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an 
> {{allow all}} policy is used instead, so all your crawled documents will be 
> indexed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to