[ 
https://issues.apache.org/jira/browse/NUTCH-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13904106#comment-13904106
 ] 

Markus Jelsma commented on NUTCH-1706:
--------------------------------------

Yes, i see the problem. I should, similar to the latest patch for the 
SegmentMerger also consider retry status. Our check out of SegmentMerger does:

{code}
        } else if (sp.partName.equals(CrawlDatum.FETCH_DIR_NAME)) {
          // only consider fetch status and ignore fetch retry status
          // https://issues.apache.org/jira/browse/NUTCH-1520
          // https://issues.apache.org/jira/browse/NUTCH-1113
          if (CrawlDatum.hasFetchStatus(val) && val.getStatus() != 
CrawlDatum.STATUS_FETCH_RETRY) {
            if (lastF == null) {
              lastF = val;
              lastFname = sp.segmentName;
            } else {
              if (lastFname.compareTo(sp.segmentName) < 0) {
                lastF = val;
                lastFname = sp.segmentName;
              }
            }
          }
        }
{code}

Do you want to include a fix for indexing multiple segments for this issue? 

> IndexerMapReduce does not remove db_redir_temp etc
> --------------------------------------------------
>
>                 Key: NUTCH-1706
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1706
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.7
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Blocker
>             Fix For: 1.8
>
>         Attachments: NUTCH-1706-trunk-v2.patch, NUTCH-1706-trunk.patch, 
> nutch-1706-testdata.tgz
>
>
> Code path is wrong in IndexerMapReduce, the delete code should be located 
> after all reducer values have been gathered.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to