[ 
https://issues.apache.org/jira/browse/NUTCH-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15286583#comment-15286583
 ] 

Sebastian Nagel commented on NUTCH-2164:
----------------------------------------

Hi [~markus17], [~jurian], [~thammegowda],
I've checked whether the modification time is properly set. E.g., from a test 
crawl, using 
[FreeGenerator|https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/tools/FreeGenerator.java]
 to emulate a re-fetch:
{noformat}
http://lucene.apache.org/       Version: 7
Status: 6 (db_notmodified)
Fetch time: Thu Jun 16 14:54:29 CEST 2016
Modified time: Tue May 17 14:33:28 CEST 2016
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.014409536
Signature: d453859bc8339e45dcb1dd4979f8155c
Metadata: 
        _depth_=2
        _pst_=success(1), lastModified=1462516125000
        _rs_=66
        Content-Type=text/html
        _maxdepth_=1000
        nutch.protocol.code=200
{noformat}
The modification time is the time of the first fetch. If you trust the 
modification time sent by the server, it would be easy to read it from the 
metadata and store it instead of the time of the last successful fetch. For 
lucene.apache.org, as expected, the time (in millis) is reasonable:
{noformat}
% date --date=@$((1462516125000/1000))
Fri May  6 08:28:45 CEST 2016
{noformat}

If there are no objections I would commit this to be included in Nutch 1.12.

> Inconsistent 'Modified Time' in crawl db
> ----------------------------------------
>
>                 Key: NUTCH-2164
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2164
>             Project: Nutch
>          Issue Type: Improvement
>          Components: crawldb, fetcher
>    Affects Versions: 1.11
>            Reporter: Thamme Gowda N
>            Priority: Minor
>
> The 'Modified time' in crawldb is invalid. It is set to (0-Timezone 
> Difference)
> *How to verify/reproduce:*
>   Run 'nutch readdb /path/to/crawldb -dump yy' and then inspect content of 
> 'yy'
> The following improvements can be done:
> 1. Set modified time by DefaultFetchSchedule
> 2. Set ProtocolStatus.lastModified if modified time is available in protocol 
> response headers
> This issue is also discussed in dev mailing lists: 
> http://www.mail-archive.com/[email protected]/msg19803.html#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to