[ https://issues.apache.org/jira/browse/NUTCH-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16509596#comment-16509596 ]
Sebastian Nagel commented on NUTCH-2565: ---------------------------------------- I thought first about making the condition in calculateLastFetchTime(datum) more strict: {code} if (datum.getStatus() == CrawlDatum.STATUS_DB_UNFETCHED && datum.getRetriesSinceFetch() == 0) { return 0L; {code} This will guarantee that we do not prefer an older DB_FETCHED over the newer DB_UNFETCHED with a "transient" failure. If there are two DB_UNFETCHED with retries > 0 to be merged, it's important that # the fetch time is the latest (for scheduling) # yes, we could sum the retry counts but then we need also to trigger a status change if retries > db.fetch.retry.max. We need also make sure not to cause a retry counter overflow (it's only a signed byte) if many CrawlDbs are merged. In short, for me this looks too complex. What do you mean? > MergeDB incorrectly handles unfetched CrawlDatums > ------------------------------------------------- > > Key: NUTCH-2565 > URL: https://issues.apache.org/jira/browse/NUTCH-2565 > Project: Nutch > Issue Type: Bug > Affects Versions: 1.14 > Reporter: Jurian Broertjes > Priority: Minor > > I ran into this issue when merging a crawlDB originating from sitemaps into > our normal crawlDB. CrawlDatums are merged based on output of > AbstractFetchSchedule::calculateLastFetchTime(). When CrawlDatums are > unfetched, this can overwrite fetchTime or other stuff. > I assume this is a bug and have a simple fix for it that checks if CrawlDatum > has status db_unfetched. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)