that might look strange but it's not a bug.
It could be improved, see below, simply because
it's not obvious - I also stumbled over this
point some time ago. It also pops up from time
to time on the mailing lists, see references below.

- when indexing the modified time (sent by the server)
  the time from the Content class
  is used by the index-more plugin

- the "modified time" stored in the CrawlDb is not the
  modified time sent by the server but the time of the
  last "real" fetch, excluding fetches which returned
  an unmodified document, either by if-modified-since
  HTTP requests or by a signature comparison.
  See also NUTCH-933.

- it is set by setFetchSchedule(...) but only by
  AdapativeFetchSchedule not by DefaultFetchSchedule
  The latter does not use, while the former "adapts"
  the re-fetch interval dependent on the change frequency.

- the lastModified field in ProtocolStatus shown by toString()
    _pst_: success(1), lastModified=0
  was obviously never used. It's probably just a relict.
  If you remove it CrawDbs become incompatible. But it
  could be filled with the modified time returned by the
  server (or, e.g. the file system for protocol-file).

As said, these could be improvements:
1 also set modified time by DefaultFetchSchedule
2 set ProtocolStatus.lastModified if modified time is available

Please, feel free to open Jira issues for these.



On 11/06/2015 01:18 AM, Thamme Gowda N. wrote:
> Hello,
> I found a strange issue with 'Modified time' in nutch crawldb.
> I dumped the crawldb using the command
> /   nutch readdb xx -dump yy/
> And inspected the 'Modified time' in the dumped content.
> Surprisingly, the 'Modified time' is invalid. All the pages have 'Modified 
> time: Wed Dec 31 16:00:00
> PST 1969' (That is 0 - 8Hrs). It is worth noting that 'lastModified=0' in 
> Metadata.
> ‚Äč
> But, I see actual value in the response header.
> I am using Nutch 1.11, can you verify whether this functionality is broken? 
> -- 
> Regards,
> Thamme Gowda N

Reply via email to