[ https://issues.apache.org/jira/browse/NUTCH-3091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17903324#comment-17903324 ]
Marcos Gomez commented on NUTCH-3091: ------------------------------------- Hi [~snagel] ! Thanks for your feedback! I created a new patch with the changes that you suggested. Changes: * Code formatting updated. * In the map stage, if the filter to delete is active, then don't filter URLs, but will do in reduce stage. * Described the new configuration option in cont/nutch-default.xml * And also I added the option as command-line argument (-filterDelete) in IndexingJob.java [^patch-NUTCH-3091.diff] > Allow URL filters to flag an existing URL to delete from index > -------------------------------------------------------------- > > Key: NUTCH-3091 > URL: https://issues.apache.org/jira/browse/NUTCH-3091 > Project: Nutch > Issue Type: New Feature > Components: indexer, urlfilter > Affects Versions: 1.20 > Reporter: Marcos Gomez > Priority: Major > Attachments: patch-NUTCH-3091.diff, patch_delete_by_url_filter.patch > > > When in the crawldb there are already URLs that when updating the > configuration of one of the URLFilter plugins are rejected, in the index > phase, but they are not removed from the index as is done with the ‘gone’ or > ‘redirects’. > Currently there is a ‘-filter’ flag that prevents these URLs from being > processed, but they are not removed, it should be possible to apply a new > option or parameter. > -- This message was sent by Atlassian Jira (v8.20.10#820010)