[ 
https://issues.apache.org/jira/browse/NUTCH-3091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17903324#comment-17903324
 ] 

Marcos Gomez commented on NUTCH-3091:
-------------------------------------

Hi [~snagel] ! Thanks for your feedback!

I created a new patch with the changes that you suggested.

Changes: 
 * Code formatting updated.
 * In the map stage, if the filter to delete is active, then don't filter URLs, 
but will do in reduce stage.
 * Described the new configuration option in cont/nutch-default.xml
 * And also I added the option as command-line argument (-filterDelete) in 
IndexingJob.java

[^patch-NUTCH-3091.diff]

> Allow URL filters to flag an existing URL to delete from index
> --------------------------------------------------------------
>
>                 Key: NUTCH-3091
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3091
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, urlfilter
>    Affects Versions: 1.20
>            Reporter: Marcos Gomez
>            Priority: Major
>         Attachments: patch-NUTCH-3091.diff, patch_delete_by_url_filter.patch
>
>
> When in the crawldb there are already URLs that when updating the 
> configuration of one of the URLFilter plugins are rejected, in the index 
> phase, but they are not removed from the index as is done with the ‘gone’ or 
> ‘redirects’.
> Currently there is a ‘-filter’ flag that prevents these URLs from being 
> processed, but they are not removed, it should be possible to apply a new 
> option or parameter.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to