[
https://issues.apache.org/jira/browse/NUTCH-3091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17899729#comment-17899729
]
Marcos Gomez commented on NUTCH-3091:
-------------------------------------
I created a tentative patch for this: [^patch_delete_by_url_filter.patch]
The new option in the patch will be called: indexer.delete.by.url.filters with
a default value of false.
> Allow URL filters to flag an existing URL to delete from index
> --------------------------------------------------------------
>
> Key: NUTCH-3091
> URL: https://issues.apache.org/jira/browse/NUTCH-3091
> Project: Nutch
> Issue Type: New Feature
> Components: indexer, urlfilter
> Affects Versions: 1.20
> Reporter: Marcos Gomez
> Priority: Major
> Attachments: patch_delete_by_url_filter.patch
>
>
> When in the crawldb there are already URLs that when updating the
> configuration of one of the URLFilter plugins are rejected, in the index
> phase, but they are not removed from the index as is done with the ‘gone’ or
> ‘redirects’.
> Currently there is a ‘-filter’ flag that prevents these URLs from being
> processed, but they are not removed, it should be possible to apply a new
> option or parameter.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)