[ 
https://issues.apache.org/jira/browse/NUTCH-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13876031#comment-13876031
 ] 

Sebastian Nagel commented on NUTCH-1706:
----------------------------------------

Hi [~markus17], by comparing the patches I found two differences
# your patch (NUTCH-1706-trunk.patch) moves the following check
{code}
if (fetchDatum == null || dbDatum == null
    || parseText == null || parseData == null) {
   return;                                             // only have inlinks
}
{code}
after the deletion of redirects. In case a redirect misses content and parse 
data, this may cause that it isn't properly removed. So you're patch seems 
right here.
# {{NUTCH-1646-3.patch}} keeps the lines
{code}
// don't index unmodified (empty) pages
if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED) {
  fetchDatum = datum;
{code}
By these lines, fetchDatum is not set (or overwritten) by if-not-modified-since 
responses. Isn't that correct? If we have multiple segments we need to pass the 
CrawlDatum of the "real" fetch to get eg. correct fetch time. Also, pages with 
fetchDatum other than FETCH_SUCCESS are skipped below in IndexerMapreduce. If 
we index each segment separately it should make no difference, but when 
indexing more segments we have to take care to catch the right fetchDatum.

> IndexerMapReduce does not remove db_redir_temp etc
> --------------------------------------------------
>
>                 Key: NUTCH-1706
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1706
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.7
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Blocker
>             Fix For: 1.8
>
>         Attachments: NUTCH-1706-trunk.patch
>
>
> Code path is wrong in IndexerMapReduce, the delete code should be located 
> after all reducer values have been gathered.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to