[
https://issues.apache.org/jira/browse/NUTCH-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13876031#comment-13876031
]
Sebastian Nagel commented on NUTCH-1706:
----------------------------------------
Hi [~markus17], by comparing the patches I found two differences
# your patch (NUTCH-1706-trunk.patch) moves the following check
{code}
if (fetchDatum == null || dbDatum == null
|| parseText == null || parseData == null) {
return; // only have inlinks
}
{code}
after the deletion of redirects. In case a redirect misses content and parse
data, this may cause that it isn't properly removed. So you're patch seems
right here.
# {{NUTCH-1646-3.patch}} keeps the lines
{code}
// don't index unmodified (empty) pages
if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED) {
fetchDatum = datum;
{code}
By these lines, fetchDatum is not set (or overwritten) by if-not-modified-since
responses. Isn't that correct? If we have multiple segments we need to pass the
CrawlDatum of the "real" fetch to get eg. correct fetch time. Also, pages with
fetchDatum other than FETCH_SUCCESS are skipped below in IndexerMapreduce. If
we index each segment separately it should make no difference, but when
indexing more segments we have to take care to catch the right fetchDatum.
> IndexerMapReduce does not remove db_redir_temp etc
> --------------------------------------------------
>
> Key: NUTCH-1706
> URL: https://issues.apache.org/jira/browse/NUTCH-1706
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 1.7
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Blocker
> Fix For: 1.8
>
> Attachments: NUTCH-1706-trunk.patch
>
>
> Code path is wrong in IndexerMapReduce, the delete code should be located
> after all reducer values have been gathered.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)