[jira] [Commented] (NUTCH-3091) Allow URL filters to flag an existing URL to delete from index

Sebastian Nagel (Jira) Wed, 04 Dec 2024 08:31:14 -0800


    [ 
https://issues.apache.org/jira/browse/NUTCH-3091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17903045#comment-17903045
 ]


Sebastian Nagel commented on NUTCH-3091:
----------------------------------------

Hi [~marcos], thanks for the contribution!

I've tested the patch and got a NPE:
{noformat}
java.lang.Exception: java.lang.NullPointerException
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492) 
~[hadoop-mapreduce-client-common-3.3.6.jar:?]
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:552) 
[hadoop-mapreduce-client-common-3.3.6.jar:?]
Caused by: java.lang.NullPointerException
        at org.apache.hadoop.io.Text.encode(Text.java:497) 
~[hadoop-common-3.3.6.jar:?]
        at org.apache.hadoop.io.Text.set(Text.java:212) 
~[hadoop-common-3.3.6.jar:?]
        at 
org.apache.nutch.indexer.IndexerMapReduce$IndexerMapper.map(IndexerMapReduce.java:194)
 ~[apache-nutch-1.21-SNAPSHOT.jar:?]
        at 
org.apache.nutch.indexer.IndexerMapReduce$IndexerMapper.map(IndexerMapReduce.java:155)
 ~[apache-nutch-1.21-SNAPSHOT.jar:?]
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) 
~[hadoop-mapreduce-client-core-3.3.6.jar:?]
{noformat}
{code:java}
if (urlString == null && !filterDelete) {
    return;
} else {
    key.set(urlString); // << NPE thrown here if URL is filtered (is null) and 
filterDelete == true
}
{code}
Could you fix this NPE? And I'd strongly recommend, the updated patch 
beforehand. You do not need to set up Solr - there's indexer-dummy which makes 
it very easy to verify whether index additions or deletions are the expected 
ones. I've just run on some test data:
{code:bash}
bin/nutch index -Dindexer.delete.by.url.filters=true 
-Dplugin.includes='indexer-dummy|indexing-basic|urlfilter-regex' crawldb -dir 
segments
{code}

Further remarks:
- the indentation does not correspond to our [code-formatting 
template|https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml]. 
Could you apply the template? Otherwise let us know and we can do it before 
committing the patch. Thanks!
- some minimal documentation is required. At least, the new property needs to 
be described in conf/nutch-default.xml with a default value (false)
- eventually, it might be an option to add the option as a command-line 
argument to the "index" job (IndexingJob.java)

> Allow URL filters to flag an existing URL to delete from index
> --------------------------------------------------------------
>
>                 Key: NUTCH-3091
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3091
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, urlfilter
>    Affects Versions: 1.20
>            Reporter: Marcos Gomez
>            Priority: Major
>         Attachments: patch_delete_by_url_filter.patch
>
>
> When in the crawldb there are already URLs that when updating the 
> configuration of one of the URLFilter plugins are rejected, in the index 
> phase, but they are not removed from the index as is done with the ‘gone’ or 
> ‘redirects’.
> Currently there is a ‘-filter’ flag that prevents these URLs from being 
> processed, but they are not removed, it should be possible to apply a new 
> option or parameter.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-3091) Allow URL filters to flag an existing URL to delete from index

Reply via email to