Yossi Tamari created NUTCH-2456:
-----------------------------------

             Summary: Redirected documents are not indexed
                 Key: NUTCH-2456
                 URL: https://issues.apache.org/jira/browse/NUTCH-2456
             Project: Nutch
          Issue Type: Bug
          Components: indexer
    Affects Versions: 1.13
            Reporter: Yossi Tamari
            Priority: Critical


If http.redirect.max is set to a positive value, the Fetcher will follow 
redirects, creating a new CrawlDatum.
If the redirected URL is fetched and parsed, during indexing for it we have a 
special case: dbDatum is null. This means that in 
[https://github.com/apache/nutch/blob/6199492f5e1e8811022257c88dbf63f1e1c739d0/src/java/org/apache/nutch/indexer/IndexerMapReduce.java#L259]
 the document is not indexed, as it is assumed it only has inlinks (actually it 
has everything but dbDatum).
I'm not sure what the correct fix is here. It seems to me the condition should 
use AND instead of OR anyway, but I may not understand the original intent. It 
is clear that it is too strict as is.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to