Hi, that might look strange but it's not a bug. It could be improved, see below, simply because it's not obvious - I also stumbled over this point some time ago. It also pops up from time to time on the mailing lists, see references below.
- when indexing the modified time (sent by the server) the time from the Content class content.getMetadata().get(Response.LAST_MODIFIED) is used by the index-more plugin - the "modified time" stored in the CrawlDb is not the modified time sent by the server but the time of the last "real" fetch, excluding fetches which returned an unmodified document, either by if-modified-since HTTP requests or by a signature comparison. See also NUTCH-933. - it is set by setFetchSchedule(...) but only by AdapativeFetchSchedule not by DefaultFetchSchedule The latter does not use, while the former "adapts" the re-fetch interval dependent on the change frequency. - the lastModified field in ProtocolStatus shown by toString() _pst_: success(1), lastModified=0 was obviously never used. It's probably just a relict. If you remove it CrawDbs become incompatible. But it could be filled with the modified time returned by the server (or, e.g. the file system for protocol-file). As said, these could be improvements: 1 also set modified time by DefaultFetchSchedule 2 set ProtocolStatus.lastModified if modified time is available Please, feel free to open Jira issues for these. Thanks, Sebastian References: https://issues.apache.org/jira/browse/NUTCH-933 http://lucene.472066.n3.nabble.com/setting-modifiedTime-in-DefaultFetchSchedule-td4020457.html http://www.mail-archive.com/nutch-user@lucene.apache.org/msg15056.html On 11/06/2015 01:18 AM, Thamme Gowda N. wrote: > Hello, > > I found a strange issue with 'Modified time' in nutch crawldb. > > I dumped the crawldb using the command > / nutch readdb xx -dump yy/ > > And inspected the 'Modified time' in the dumped content. > > Surprisingly, the 'Modified time' is invalid. All the pages have 'Modified > time: Wed Dec 31 16:00:00 > PST 1969' (That is 0 - 8Hrs). It is worth noting that 'lastModified=0' in > Metadata. > > > But, I see actual value in the response header. > > > > I am using Nutch 1.11, can you verify whether this functionality is broken? > > -- > Regards, > Thamme Gowda N