[jira] [Commented] (NUTCH-2456) Allow to index pages/URLs not contained in CrawlDb

ASF GitHub Bot (JIRA) Wed, 08 Nov 2017 06:47:14 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244071#comment-16244071
 ]


ASF GitHub Bot commented on NUTCH-2456:
---------------------------------------

sebastian-nagel commented on a change in pull request #240: NUTCH-2456 - 
Redirected documents are not indexed
URL: https://github.com/apache/nutch/pull/240#discussion_r149685844
 
 

 ##########
 File path: src/java/org/apache/nutch/indexer/IndexerMapReduce.java
 ##########
 @@ -256,20 +256,19 @@ public void reduce(Text key, Iterator<NutchWritable> 
values,
       }
 
 Review comment:
   dbDatum is already used above (line 240 and following) when deleting gone 
pages and redirects. Deletions are done if either fetchDatum or dbDatum match 
the status (gone resp. redirect). Why not also relax these conditions so that 
dbDatum is optional? fetchDatum should exist, otherwise every index job will 
send deletions for **all** 404s/redirects in CrawlDb including those already 
deleted in the rounds before.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Allow to index pages/URLs not contained in CrawlDb
> --------------------------------------------------
>
>                 Key: NUTCH-2456
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2456
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.13
>            Reporter: Yossi Tamari
>            Priority: Critical
>
> If http.redirect.max is set to a positive value, the Fetcher will follow 
> redirects, creating a new CrawlDatum.
> If the redirected URL is fetched and parsed, during indexing for it we have a 
> special case: dbDatum is null. This means that in 
> [https://github.com/apache/nutch/blob/6199492f5e1e8811022257c88dbf63f1e1c739d0/src/java/org/apache/nutch/indexer/IndexerMapReduce.java#L259]
>  the document is not indexed, as it is assumed it only has inlinks (actually 
> it has everything but dbDatum).
> I'm not sure what the correct fix is here. It seems to me the condition 
> should use AND instead of OR anyway, but I may not understand the original 
> intent. It is clear that it is too strict as is.
> However, the code following that line assumes all 4 objects are not null, so 
> a patch would need to change more than just the condition.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2456) Allow to index pages/URLs not contained in CrawlDb

Reply via email to