[
https://issues.apache.org/jira/browse/NUTCH-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280072#comment-15280072
]
ASF GitHub Bot commented on NUTCH-2164:
---------------------------------------
GitHub user sebastian-nagel opened a pull request:
https://github.com/apache/nutch/pull/108
NUTCH-2164 NUTCH-2242 Inconsistent 'Modified Time' in crawl db / last…
…Modified not always set
- set modified time (time of last successful fetch) by
DefaultFetchSchedule and AdaptiveFetchSchedule
but only if the document is actually modified
- update unit tests to check whether modification time is properly set
- set modified time (sent by responding server in HTTP header) in
ProtocolOutput:
FetchSchedule implementations can access the HTTP modified time from
CrawlDatum's
metadata (PROTO_STATUS_KEY = "_pst_")
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sebastian-nagel/nutch NUTCH-2164
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/nutch/pull/108.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #108
----
commit b0c2969e47a3129a0abd0f98b736616ebaf5b540
Author: Sebastian Nagel <[email protected]>
Date: 2016-03-11T21:55:24Z
NUTCH-2164 NUTCH-2242 Inconsistent 'Modified Time' in crawl db /
lastModified not always set
- set modified time (time of last successful fetch) by
DefaultFetchSchedule and AdaptiveFetchSchedule
but only if the document is actually modified
- update unit tests to check whether modification time is properly set
- set modified time (sent by responding server in HTTP header) in
ProtocolOutput:
FetchSchedule implementations can access the HTTP modified time from
CrawlDatum's
metadata (PROTO_STATUS_KEY = "_pst_")
----
> Inconsistent 'Modified Time' in crawl db
> ----------------------------------------
>
> Key: NUTCH-2164
> URL: https://issues.apache.org/jira/browse/NUTCH-2164
> Project: Nutch
> Issue Type: Improvement
> Components: crawldb, fetcher
> Affects Versions: 1.11
> Reporter: Thamme Gowda N
> Priority: Minor
>
> The 'Modified time' in crawldb is invalid. It is set to (0-Timezone
> Difference)
> *How to verify/reproduce:*
> Run 'nutch readdb /path/to/crawldb -dump yy' and then inspect content of
> 'yy'
> The following improvements can be done:
> 1. Set modified time by DefaultFetchSchedule
> 2. Set ProtocolStatus.lastModified if modified time is available in protocol
> response headers
> This issue is also discussed in dev mailing lists:
> http://www.mail-archive.com/[email protected]/msg19803.html#
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)