[ https://issues.apache.org/jira/browse/NUTCH-3091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17903045#comment-17903045 ]
Sebastian Nagel commented on NUTCH-3091: ---------------------------------------- Hi [~marcos], thanks for the contribution! I've tested the patch and got a NPE: {noformat} java.lang.Exception: java.lang.NullPointerException at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492) ~[hadoop-mapreduce-client-common-3.3.6.jar:?] at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:552) [hadoop-mapreduce-client-common-3.3.6.jar:?] Caused by: java.lang.NullPointerException at org.apache.hadoop.io.Text.encode(Text.java:497) ~[hadoop-common-3.3.6.jar:?] at org.apache.hadoop.io.Text.set(Text.java:212) ~[hadoop-common-3.3.6.jar:?] at org.apache.nutch.indexer.IndexerMapReduce$IndexerMapper.map(IndexerMapReduce.java:194) ~[apache-nutch-1.21-SNAPSHOT.jar:?] at org.apache.nutch.indexer.IndexerMapReduce$IndexerMapper.map(IndexerMapReduce.java:155) ~[apache-nutch-1.21-SNAPSHOT.jar:?] at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) ~[hadoop-mapreduce-client-core-3.3.6.jar:?] {noformat} {code:java} if (urlString == null && !filterDelete) { return; } else { key.set(urlString); // << NPE thrown here if URL is filtered (is null) and filterDelete == true } {code} Could you fix this NPE? And I'd strongly recommend, the updated patch beforehand. You do not need to set up Solr - there's indexer-dummy which makes it very easy to verify whether index additions or deletions are the expected ones. I've just run on some test data: {code:bash} bin/nutch index -Dindexer.delete.by.url.filters=true -Dplugin.includes='indexer-dummy|indexing-basic|urlfilter-regex' crawldb -dir segments {code} Further remarks: - the indentation does not correspond to our [code-formatting template|https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml]. Could you apply the template? Otherwise let us know and we can do it before committing the patch. Thanks! - some minimal documentation is required. At least, the new property needs to be described in conf/nutch-default.xml with a default value (false) - eventually, it might be an option to add the option as a command-line argument to the "index" job (IndexingJob.java) > Allow URL filters to flag an existing URL to delete from index > -------------------------------------------------------------- > > Key: NUTCH-3091 > URL: https://issues.apache.org/jira/browse/NUTCH-3091 > Project: Nutch > Issue Type: New Feature > Components: indexer, urlfilter > Affects Versions: 1.20 > Reporter: Marcos Gomez > Priority: Major > Attachments: patch_delete_by_url_filter.patch > > > When in the crawldb there are already URLs that when updating the > configuration of one of the URLFilter plugins are rejected, in the index > phase, but they are not removed from the index as is done with the ‘gone’ or > ‘redirects’. > Currently there is a ‘-filter’ flag that prevents these URLs from being > processed, but they are not removed, it should be possible to apply a new > option or parameter. > -- This message was sent by Atlassian Jira (v8.20.10#820010)