[
https://issues.apache.org/jira/browse/NUTCH-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13904106#comment-13904106
]
Markus Jelsma commented on NUTCH-1706:
--------------------------------------
Yes, i see the problem. I should, similar to the latest patch for the
SegmentMerger also consider retry status. Our check out of SegmentMerger does:
{code}
} else if (sp.partName.equals(CrawlDatum.FETCH_DIR_NAME)) {
// only consider fetch status and ignore fetch retry status
// https://issues.apache.org/jira/browse/NUTCH-1520
// https://issues.apache.org/jira/browse/NUTCH-1113
if (CrawlDatum.hasFetchStatus(val) && val.getStatus() !=
CrawlDatum.STATUS_FETCH_RETRY) {
if (lastF == null) {
lastF = val;
lastFname = sp.segmentName;
} else {
if (lastFname.compareTo(sp.segmentName) < 0) {
lastF = val;
lastFname = sp.segmentName;
}
}
}
}
{code}
Do you want to include a fix for indexing multiple segments for this issue?
> IndexerMapReduce does not remove db_redir_temp etc
> --------------------------------------------------
>
> Key: NUTCH-1706
> URL: https://issues.apache.org/jira/browse/NUTCH-1706
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 1.7
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Blocker
> Fix For: 1.8
>
> Attachments: NUTCH-1706-trunk-v2.patch, NUTCH-1706-trunk.patch,
> nutch-1706-testdata.tgz
>
>
> Code path is wrong in IndexerMapReduce, the delete code should be located
> after all reducer values have been gathered.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)